Skip to content

πŸ“„ Multimodal RAG pipeline combining ColPALI visual retrieval, YOLO-DocLayNet layout detection, sentence embedding-based text retrieval, and LLaMA-4 completion for document question answering.

Notifications You must be signed in to change notification settings

hanifsyarubany/VisText-RAG-Document-QNA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

17 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

VisText-RAG-DocumentQNA

This repository provides a complete implementation of a multimodal RAG system designed for document question answering. It includes an indexing pipeline that processes your document corpora by extracting both text and visual elements (such as tables and figures) and storing them in a vector database using sentence embeddings and ColPALI image embeddings. It also features a chat inference pipeline that handles user queries, performs dual retrieval over text and image embeddings, and generates context-aware answers using a vision-capable language model. This setup enables accurate and explainable retrieval from visually rich documents.

πŸ“ You can read the full article here:
πŸ‘‰ ColPALI Meets DocLayNet: A Vision-Aware Multimodal RAG for Document QA - Medium

πŸ–ΌοΈ Visual Examples

πŸ” Evaluation: Comparison with Common RAG Pipeline

Evaluation Result

Our pipeline demonstrates superior retrieval accuracy and multimodal understanding compared to common RAG pipelines, especially in handling visually complex document content.


🎨 Chainlit App – Frontend Overview

Streamlit UI


βš™οΈ Reproducing the Environment

# Create Conda Environment
conda create -n multimodal_rag python=3.11
conda activate multimodal_rag

# Install Libraries
pip install -r requirements.txt

πŸš€ Running the Application

Once you've set up the environment and downloaded the required models, you can launch both the backend and frontend with the following commands:

βœ… Run the Backend Server (FastAPI)

uvicorn main:app --port 8000 

βœ… Run the Frontend (Streamlit)

streamlit run frontend.py --port 8001

πŸ“š Adding Knowledge Base

You can enhance the chatbot's responses by providing your own knowledge base (PDF documents). Before indexing any documents, ensure that the backend server is running.

  1. Put your PDF files into the following directory
document_sources/
  1. Then, run the following command to index new documents:
python execute_indexing.py
  1. If you want to refresh the entire indexing pipeline (i.e., delete old vectors and start fresh from document_sources/), run:
python execute_indexing.py --initialize

πŸ“š Citation

If you use this work, please consider citing the following foundational papers:

@misc{faysse2024colpaliefficientdocumentretrieval,
  title={ColPali: Efficient Document Retrieval with Vision Language Models}, 
  author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and CΓ©line Hudelot and Pierre Colombo},
  year={2024},
  eprint={2407.01449},
  archivePrefix={arXiv},
  primaryClass={cs.IR},
  url={https://arxiv.org/abs/2407.01449}, 
}
@inproceedings{pfitzmann2022doclaynet,
  title={Doclaynet: A large human-annotated dataset for document-layout segmentation},
  author={Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter},
  booktitle={Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining},
  pages={3743--3751},
  year={2022}
}
@inproceedings{reimers-2020-multilingual-sentence-bert,
    title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2020",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/2004.09813",
}

About

πŸ“„ Multimodal RAG pipeline combining ColPALI visual retrieval, YOLO-DocLayNet layout detection, sentence embedding-based text retrieval, and LLaMA-4 completion for document question answering.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published