A sophisticated platform leveraging advanced large language models (LLMs) for the dynamic resolution of user queries, utilizing state-of-the-art natural language processing (NLP) and machine learning algorithms to facilitate context-aware and precise information retrieval and problem-solving.
The goal of this project is to create a comprehensive system for extracting content from textbooks, indexing it using RAPTOR in a MILVUS vector database, and developing a question-answering system using a Language Model (LLM). The assessment covers various aspects such as data extraction, data processing, vector database creation, retrieval techniques, and natural language processing.
-
Textbook Selection:
- The following textbooks were selected for content extraction and processing in this project:
-
Introduction to Algorithms (4th Edition)
Authors: Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein
Publisher: MIT Press
Link: Introduction to Algorithms (4th Edition) -
Handbook of Natural Language Processing (Second Edition)
Editor: Dan Jurafsky and James H. Martin
Publisher: Chapman & Hall/CRC
Link: Handbook of Natural Language Processing (Second Edition) -
System Analysis and Design
Publisher: Informatics Institute
Link: System Analysis and Design
-
Content Extraction:
- Extracted content from the selected textbooks, ensuring thorough coverage of all relevant text.
-
Data Chunking:
- Chunking the extracted content into short, contiguous texts of approximately 100 tokens each, preserving sentence boundaries.
- Note: Use NLTK’s
word_tokenize
for tokenization.
-
Embedding and Indexing:
- Embeded the chunked texts using Sentence-BERT (SBERT) to create vector representations.
- Implemented RAPTOR indexing with the following steps:
- Clustering: Used Gaussian Mixture Models (GMMs) with soft clustering to allow nodes to belong to multiple clusters.
- Summarization: Summarizing clusters using an LLM (e.g., GPT-3.5-turbo) to create concise representations.
- Recursive Clustering and Summarization: Re-embeded the summarized texts and recursively apply clustering and summarization to form a hierarchical tree structure.
- Stored the RAPTOR index in a MILVUS vector database, including metadata such as textbook titles and page numbers.
-
Query Expansion:
- Implement query expansion techniques such as synonym expansion, stemming, or using external knowledge bases.
-
Hybrid Retrieval Methods:
- Combine BM25 (Best Match 25) with BERT/bi-encoder-based retrieval methods like Dense Passage Retrieval (DPR) and Semantic Passage Retrieval (SPIDER).
-
Re-ranking:
- Re-rank retrieved data based on relevance and similarity using appropriate ranking algorithms.
-
LLM Integration:
- Use an LLM (e.g., OpenAI’s GPT-3.5-turbo) to generate accurate and relevant answers based on the retrieved data.
-
Fallback to Wikipedia:
- If no relevant context is available, use Wikipedia as a fallback for answering questions. Handle Wikipedia API requests and error handling properly.
- UI Development:
- Developed a user interface using frameworks like Streamlit to demonstrate the system’s functionality.
- The interface should allow users to input queries and view retrieved answers along with corresponding textbook titles and page numbers.
- Python 3.7 or later
- Git
git clone https://github.com/Anamicca23/ResolvaBot-LLM.git
cd ResolvaBot-LLM
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
pip install -r requirements.txt
- Create a
.env
file in the root directory and add your OpenAI API key:
OPENAI_API_KEY=your_openai_api_key
- For Wikipedia API, ensure your
user_agent
is set appropriately in your code.
- Run the script to create the index from the textbooks:
python Raptor indexing.py
- Start the Streamlit app interface:
- make sure you replace path of directory or files.
streamlit run Resolvabotapp.py
- Upload Textbooks: Use the Streamlit interface to upload textbooks for indexing.
- Query the System: Input queries into the interface to retrieve and view answers.
- View Results: Examine the results displayed, including relevant context and answers.
- Appropriateness of textbooks and completeness of content extraction.
- Effectiveness of data chunking and RAPTOR indexing processes.
- Quality of retrieval techniques, including query expansion and hybrid methods.
- Relevance and accuracy of re-ranking algorithms.
- Accuracy of LLM-generated answers.
- Overall system performance and efficiency.
- (Optional) User interface design and user experience.
- RAPTOR Paper: RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
- MILVUS Documentation: MILVUS Documentation
- Relevant Python Libraries:
- NLTK: Natural Language Toolkit
- Gensim: Gensim Documentation
- Transformers: Transformers Documentation
- PymuPDF2: PymuPDF Documentation
- Pyserini: Pyserini Documentation
- Sentence-Transformers: Sentence-Transformers Documentation