Advenced RAG

Local RAG pipeline we're going to build:

All using open-source tools.

In our specific example, we will create TündahChat, a RAG workflow that allows a person to query knowledge bases about customary marriage practices in Africa in general.

Note: Tündah is a web platform where information is published on how marriages are organized in Cameroon in particular and in Africa in general.

You can also run notebook Tundah.ipynb directly in locally

Project Structure

Tundah.ipynb: This notebook outlines the sequential workflow of Tundah-RAG, providing a step-by-step process.
assets: A repository for supplementary data that supports the project.
Pdf_path:This directory holds all the source PDFs, which serve as the information backbone for our RAG system.
static: Contains supplementary assets related to the Streamlit interface, including logos and other media.
Structured_file:Stores the processed results of the PDFs, refined through the Marker model for structured output.
Transcript_path:Includes the video.json file, which contains links and titles of YouTube videos relevant to customary marriages in Africa.
Tundah/Classes: This folder organizes classes designed to structure our codebase, making it reusable, maintainable, and compliant with best practices in software engineering.
main.py: Provides a command-line interface for testing the code.
streamlit.py:The Streamlit interface facilitates a seamless user-RAG interaction, enhancing usability.

Dataset

The dataset used to build the RAG system focuses on aspects related to customary marriages in 10 African countries: Cameroon, Kenya, Nigeria, South Africa, Zimbabwe, Tanzania, Uganda, Botswana, and Mali. Currently, the data is sourced from two main channels: cultural articles/books and YouTube videos. In the future, we plan to perform web scraping on websites and blogs that are rich in reliable and accessible information.

Architecture

Global architecture

UML: Class Diagram

Setup

clone

git clone https://github.com/Omer-alt/Tundah-RAG.git

cd Tundah-RAG

Install requirements

pip install -r requirements.txt

Launch notebook

VS Code:

code .

Jupyter Notebook

jupyter notebook

Run

Lunch docker, run Qdrant, run ollama

open -a docker
docker run -p ['Qdriant_id'] 
ollama run llama2:7b-chat-q4_0

Run in the console

python main.py

Run with streamlit interface

streamlit run streamlit.py

Limitations

One significant limitation is the availability of datasets. To address this, I considered using transcripts from YouTube videos. However, another significant challenge arises: the videos deemed relevant by local communities are often in low-resource languages (Example: Customary marriages in Ghana), which impacts the quality of embeddings and the performance of Large Language Models (LLMs) in such contexts. This issue is highlighted in the recent paper 'IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models`.

Perspective

Creating a Docker Image for the RAG System
Implementing CI/CD for the RAG System

Appendix

Tutorials that can help to better understand this project.

License

MIT

⭐️ If you find this repository helpful, we’d be thrilled if you could give it a star!

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.streamlit		.streamlit
Pdf_path		Pdf_path
Structured_files		Structured_files
Transcript_path		Transcript_path
Tundah		Tundah
static		static
.DS_Store		.DS_Store
.env		.env
.gitignore		.gitignore
README.md		README.md
Tundah.ipynb		Tundah.ipynb
main.py		main.py
requirements.txt		requirements.txt
streamlit.py		streamlit.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Advenced RAG

Project Structure

Dataset

Architecture

Global architecture

UML: Class Diagram

Setup

clone

Install requirements

Launch notebook

Run

Limitations

Perspective

Appendix

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Omer-alt/Tundah-RAG

Folders and files

Latest commit

History

Repository files navigation

Advenced RAG

Project Structure

Dataset

Architecture

Global architecture

UML: Class Diagram

Setup

clone

Install requirements

Launch notebook

Run

Limitations

Perspective

Appendix

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages