ResolvaBot-LLM - A User Query Resolving Bot for Textbooks.

A sophisticated platform leveraging advanced large language models (LLMs) for the dynamic resolution of user queries, utilizing state-of-the-art natural language processing (NLP) and machine learning algorithms to facilitate context-aware and precise information retrieval and problem-solving.

Objective

The goal of this project is to create a comprehensive system for extracting content from textbooks, indexing it using RAPTOR in a MILVUS vector database, and developing a question-answering system using a Language Model (LLM). The assessment covers various aspects such as data extraction, data processing, vector database creation, retrieval techniques, and natural language processing.

Task Description

1. Textbook Selection and Content Extraction

Textbook Selection:
- The following textbooks were selected for content extraction and processing in this project:
1. Introduction to Algorithms (4th Edition)
  Authors: Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein
  Publisher: MIT Press
  Link: Introduction to Algorithms (4th Edition)
2. Handbook of Natural Language Processing (Second Edition)
  Editor: Dan Jurafsky and James H. Martin
  Publisher: Chapman & Hall/CRC
  Link: Handbook of Natural Language Processing (Second Edition)
3. System Analysis and Design
  Publisher: Informatics Institute
  Link: System Analysis and Design
Content Extraction:
- Extracted content from the selected textbooks, ensuring thorough coverage of all relevant text.

2. Data Chunking and RAPTOR Indexing

Data Chunking:
- Chunking the extracted content into short, contiguous texts of approximately 100 tokens each, preserving sentence boundaries.
- Note: Use NLTK’s word_tokenize for tokenization.
Embedding and Indexing:
- Embeded the chunked texts using Sentence-BERT (SBERT) to create vector representations.
- Implemented RAPTOR indexing with the following steps:
  - Clustering: Used Gaussian Mixture Models (GMMs) with soft clustering to allow nodes to belong to multiple clusters.
  - Summarization: Summarizing clusters using an LLM (e.g., GPT-3.5-turbo) to create concise representations.
  - Recursive Clustering and Summarization: Re-embeded the summarized texts and recursively apply clustering and summarization to form a hierarchical tree structure.
- Stored the RAPTOR index in a MILVUS vector database, including metadata such as textbook titles and page numbers.

3. Retrieval Techniques

Query Expansion:
- Implement query expansion techniques such as synonym expansion, stemming, or using external knowledge bases.
Hybrid Retrieval Methods:
- Combine BM25 (Best Match 25) with BERT/bi-encoder-based retrieval methods like Dense Passage Retrieval (DPR) and Semantic Passage Retrieval (SPIDER).
Re-ranking:
- Re-rank retrieved data based on relevance and similarity using appropriate ranking algorithms.

4. Question Answering

LLM Integration:
- Use an LLM (e.g., OpenAI’s GPT-3.5-turbo) to generate accurate and relevant answers based on the retrieved data.
Fallback to Wikipedia:
- If no relevant context is available, use Wikipedia as a fallback for answering questions. Handle Wikipedia API requests and error handling properly.

5. User Interface (Optional)

UI Development:
- Developed a user interface using frameworks like Streamlit to demonstrate the system’s functionality.
- The interface should allow users to input queries and view retrieved answers along with corresponding textbook titles and page numbers.

streamlit interface:

Installation and Setup

Prerequisites

Python 3.7 or later
Git

Clone the Repository

git clone https://github.com/Anamicca23/ResolvaBot-LLM.git
cd ResolvaBot-LLM

Create a Virtual Environment

python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Install Dependencies

pip install -r requirements.txt

Setup Environment Variables

Create a .env file in the root directory and add your OpenAI API key:

OPENAI_API_KEY=your_openai_api_key

For Wikipedia API, ensure your user_agent is set appropriately in your code.

Initialize the Index

Run the script to create the index from the textbooks:

python Raptor indexing.py

Run the Application

Start the Streamlit app interface:

make sure you replace path of directory or files.

streamlit run Resolvabotapp.py

Usage

Upload Textbooks: Use the Streamlit interface to upload textbooks for indexing.
Query the System: Input queries into the interface to retrieve and view answers.
View Results: Examine the results displayed, including relevant context and answers.

Evaluation Criteria

Appropriateness of textbooks and completeness of content extraction.
Effectiveness of data chunking and RAPTOR indexing processes.
Quality of retrieval techniques, including query expansion and hybrid methods.
Relevance and accuracy of re-ranking algorithms.
Accuracy of LLM-generated answers.
Overall system performance and efficiency.
(Optional) User interface design and user experience.

Resources

RAPTOR Paper: RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
MILVUS Documentation: MILVUS Documentation
Relevant Python Libraries:
- NLTK: Natural Language Toolkit
- Gensim: Gensim Documentation
- Transformers: Transformers Documentation
- PymuPDF2: PymuPDF Documentation
- Pyserini: Pyserini Documentation
- Sentence-Transformers: Sentence-Transformers Documentation

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
ResolveaBot_LLM		ResolveaBot_LLM
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ResolvaBot-LLM - A User Query Resolving Bot for Textbooks.

Objective

Task Description

1. Textbook Selection and Content Extraction

2. Data Chunking and RAPTOR Indexing

3. Retrieval Techniques

4. Question Answering

5. User Interface (Optional)

streamlit interface:

Installation and Setup

Prerequisites

Clone the Repository

Create a Virtual Environment

Install Dependencies

Setup Environment Variables

Initialize the Index

Run the Application

Usage

Evaluation Criteria

Resources

About

Uh oh!

Releases

Packages

Languages

Anamicca23/ResolvaBot-LLM

Folders and files

Latest commit

History

Repository files navigation

ResolvaBot-LLM - A User Query Resolving Bot for Textbooks.

Objective

Task Description

1. Textbook Selection and Content Extraction

2. Data Chunking and RAPTOR Indexing

3. Retrieval Techniques

4. Question Answering

5. User Interface (Optional)

streamlit interface:

Installation and Setup

Prerequisites

Clone the Repository

Create a Virtual Environment

Install Dependencies

Setup Environment Variables

Initialize the Index

Run the Application

Usage

Evaluation Criteria

Resources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages