This project demonstrates a Retrieval-Augmented Generation (RAG) pipeline using LangChain, OpenAI, and ChromaDB. The goal is to answer user questions contextually using a vector database built from custom .docx
lecture notes.
- π Loads
.docx
lecture notes usingDocx2txtLoader
- π Splits text using markdown headers and character-based chunking
- π§ Embeds documents with OpenAI
text-embedding-ada-002
- π¦ Stores embeddings in Chroma vector database
- π Retrieves relevant context via MMR (Maximal Marginal Relevance)
- π¬ Generates contextual responses using GPT-4
- π§Ύ Automatically includes the source lecture title for transparency
- RAG_Project.ipynb # Main notebook demonstrating RAG pipeline
- ./intro-ds-vectorstore/ # Directory where the vectorstore is persisted
- Introduction_to_Data_and_Data_Science_2.docx # lecture note input
- Make sure to add your OpenAI key in a .env file: OPENAI_API_KEY=your_key_here
The notebook loads and cleans .docx lecture notes.
It splits the content into meaningful chunks using markdown headers and sentence-based logic.
The chunks are embedded and stored in a persistent Chroma vectorstore.
On receiving a user query, relevant documents are retrieved based on semantic similarity.
The GPT-4 model answers the question using only the retrieved context.
The output ends with the name of the relevant lecture as the source.
Which programming language do data scientists use?
Sample Output:
Python is the most commonly used programming language in data science due to its rich ecosystem of libraries and community support.
Resources: Introduction to Programming for Data Science
Add a Streamlit or Gradio UI
Support PDF and web-based loaders
Improve chunking strategy with overlap
Allow multi-document indexing
Aung Kaung Pyae Paing β AI/ML Enthusiast