Official implementation of the research paper:
"Markov-Enhanced Clustering for Long Document Summarization: Tackling the 'Lost in the Middle' Challenge with Large Language Models"
by Aziz Amari and Mohamed Achref Ben Ammar, INSAT, University of Carthage, Tunisia.
As the volume of information continues to grow across domains, the need for accurate and scalable automatic text summarization methods has become increasingly critical. While large language models (LLMs) have significantly advanced abstractive summarization, they remain challenged by the "lost in the middle" problem — the inability to effectively capture and retain critical information in lengthy documents.
This repository presents a hybrid summarization framework that integrates extractive and abstractive techniques. The method involves chunking the input text, embedding and clustering the segments, generating summaries for each cluster, and constructing a final coherent summary based on a Markov chain representation of semantic transitions between clusters. This graph-based ordering mechanism enhances logical flow and relevance in the generated summary.
The summarization pipeline consists of the following key components:
-
Chunking
The input document is segmented into smaller, semantically meaningful units. -
Embedding
Each chunk is converted into a high-dimensional vector representation using pre-trained embedding models. -
Clustering
Chunks are grouped based on similarity in the embedding space to identify underlying topics. -
Cluster Refinement
Outliers are removed and reclustering is performed to ensure coherent groupings. -
Centroid-Based Selection
Chunks closest to each cluster's centroid are selected as representative passages. -
Abstractive Summarization of Clusters
Selected chunks are summarized using an LLM to capture each cluster’s key idea. -
Cluster Labeling
Each cluster is assigned a semantic label based on the output summaries. -
Markov Transition Graph Construction
A transition matrix is created to model the semantic flow between clusters. -
Ranking and Path Inference
The most coherent sequence of clusters is determined via graph traversal. -
Final Summary Generation
The selected path is used to generate the final abstractive summary of the entire document.
To reproduce the results and run the summarization pipeline locally, follow the steps below:
This repository includes BLEURT as a submodule for evaluation purposes.
git clone --recursive https://github.com/voyverse/LargeDocSum.git
If you cloned the repository without --recursive
, initialize the submodule manually:
cd LargeDocSum
git submodule update --init --recursive
This project utilizes the nomic-embed-text
model via Ollama:
- Download and install Ollama: https://ollama.com/download
- Pull the required model:
ollama pull nomic-embed-text
cd LargeDocSum
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
python -m spacy download en_core_web_lg
cd bleurt
pip install .
wget https://storage.googleapis.com/bleurt-oss-21/BLEURT-20.zip
unzip BLEURT-20.zip
Create a .env
file in the root directory with the following entries:
GPT_4O_MINI_AZURE_TARGET_URI=your_azure_uri_here
GPT_4O_MINI_AZURE_API_KEY=your_api_key_here
GPT_4O_MINI_AZURE_API_VERSION=your_api_version_here
ENVIRONMENT="local"
Once all dependencies are set up, you can execute the summarization pipeline on your documents.
This system is designed for generating coherent summaries of long-form documents across a variety of domains such as legal texts, scientific literature, and technical manuals. It is particularly suitable for applications where both topical coverage and logical flow are critical.
Once you've installed the dependencies and setup the environment, you can easily integrate the summarization pipeline into your own scripts:
from src import summarization_pipeline
from openai import OpenAI
# Load document (replace with your file loading logic)
with open("example_document.txt", "r", encoding="utf-8") as f:
document_text = f.read()
# Create the OpenAI-compatible client (Ollama or your Azure endpoint)
client = OpenAI(base_url='http://localhost:11434/v1', api_key="ollama")
# Run the summarization pipeline
summary_output = summarization_pipeline.pipeline(
chunking_method="recursive",
chunking_params={"max_length": 500, "overlap": 20},
embed_model_name="nomic-embed-text",
summ_model_name="gpt-4o-mini",
doc=document_text,
top_k=20,
system_prompt_aggregate_summaries="You are an expert AI writing assistant...",
system_prompt_docsummary="You are an expert writing assistant...",
llm_instructions_doc_summary="Combine the provided chapters into a single document...",
reference_summary=None, # Set this if you want to compute metrics
client=client,
log=True
)
# Print the final summary
print(summary_output["summary"])
This will return a dictionary with the following keys:
"summary"
— the final generated summary"rouge"
,"bertscore"
— evaluation metrics (if a reference summary is provided)"coherence"
— internal coherence score"blue_rt"
— estimated real-time response duration
To test the interactive summarization demo:
-
Install Streamlit if you haven't already:
pip install streamlit
-
Run the app:
streamlit run app/streamlit_app.py
The Streamlit app allows you to upload documents, select chunking strategies, adjust parameters, and view summaries and metrics with a clean UI.
If you use this codebase in your research, please cite the following paper:
@inproceedings{amaribenammar2025markov,
title = {Markov-Enhanced Clustering for Long Document Summarization: Tackling the 'Lost in the Middle' Challenge with Large Language Models},
author = {Aziz Amari and Mohamed Achref Ben Ammar},
booktitle = {Proceedings of AIAI 2025 - 21st International Conference on Artificial Intelligence Applications and Innovations - Cyprus University of Technology, Limassol, Cyprus},
year = {2025},
institution = {National Institute of Applied Science and Technology (INSAT), University of Carthage}
}
For questions, feedback, or collaboration inquiries, please contact:
- Aziz Amari: aziz.amari@insat.ucar.tn
- Mohamed Achref Ben Ammar: mohamedachref.benammar@insat.ucar.tn