Local RAG pipeline we're going to build:
All using open-source tools.
In our specific example, we will create TündahChat, a RAG workflow that allows a person to query knowledge bases about customary marriage practices in Africa in general.
Note: Tündah is a web platform where information is published on how marriages are organized in Cameroon in particular and in Africa in general.
You can also run notebook Tundah.ipynb
directly in locally
-
Tundah.ipynb
: This notebook outlines the sequential workflow of Tundah-RAG, providing a step-by-step process. -
assets
: A repository for supplementary data that supports the project. -
Pdf_path
:This directory holds all the source PDFs, which serve as the information backbone for our RAG system. -
static
: Contains supplementary assets related to the Streamlit interface, including logos and other media. -
Structured_file
:Stores the processed results of the PDFs, refined through the Marker model for structured output. -
Transcript_path
:Includes the video.json file, which contains links and titles of YouTube videos relevant to customary marriages in Africa. -
Tundah/Classes
: This folder organizes classes designed to structure our codebase, making it reusable, maintainable, and compliant with best practices in software engineering. -
main.py
: Provides a command-line interface for testing the code. -
streamlit.py
:The Streamlit interface facilitates a seamless user-RAG interaction, enhancing usability.
The dataset used to build the RAG system focuses on aspects related to customary marriages in 10 African countries: Cameroon, Kenya, Nigeria, South Africa, Zimbabwe, Tanzania, Uganda, Botswana, and Mali. Currently, the data is sourced from two main channels: cultural articles/books and YouTube videos. In the future, we plan to perform web scraping on websites and blogs that are rich in reliable and accessible information.
git clone https://github.com/Omer-alt/Tundah-RAG.git
cd Tundah-RAG
pip install -r requirements.txt
VS Code:
code .
Jupyter Notebook
jupyter notebook
Lunch docker, run Qdrant, run ollama
open -a docker
docker run -p ['Qdriant_id']
ollama run llama2:7b-chat-q4_0
Run in the console
python main.py
Run with streamlit interface
streamlit run streamlit.py
- One significant limitation is the availability of datasets. To address this, I considered using transcripts from YouTube videos. However, another significant challenge arises: the videos deemed relevant by local communities are often in low-resource languages (Example: Customary marriages in Ghana), which impacts the quality of embeddings and the performance of Large Language Models (LLMs) in such contexts. This issue is highlighted in the recent paper 'IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models`.
- Creating a Docker Image for the RAG System
- Implementing CI/CD for the RAG System
Tutorials that can help to better understand this project.
- simple-local-rag
- Deep Dive into Retrieval Augmented Generation (RAG) - Architecture & Working of Naive and Advanced RAG Framework.
- Docker Crash Course for Absolute Beginners
- ChatGPT Prompt Engineering for Developers
- Bases de Données Vectorielles : Expérience & Conseils d'Expert
⭐️ If you find this repository helpful, we’d be thrilled if you could give it a star!