Skip to content

A sophisticated platform leveraging advanced large language models (LLMs) for the dynamic resolution of user queries, utilizing state-of-the-art natural language processing (NLP) and machine learning algorithms to facilitate context-aware and precise information retrieval and problem-solving.

Notifications You must be signed in to change notification settings

Anamicca23/ResolvaBot-LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 

Repository files navigation

ResolvaBot-LLM - A User Query Resolving Bot for Textbooks.

A sophisticated platform leveraging advanced large language models (LLMs) for the dynamic resolution of user queries, utilizing state-of-the-art natural language processing (NLP) and machine learning algorithms to facilitate context-aware and precise information retrieval and problem-solving.

Objective

The goal of this project is to create a comprehensive system for extracting content from textbooks, indexing it using RAPTOR in a MILVUS vector database, and developing a question-answering system using a Language Model (LLM). The assessment covers various aspects such as data extraction, data processing, vector database creation, retrieval techniques, and natural language processing.

Task Description

1. Textbook Selection and Content Extraction

  1. Textbook Selection:

    • The following textbooks were selected for content extraction and processing in this project:
    1. Introduction to Algorithms (4th Edition)
      Authors: Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein
      Publisher: MIT Press
      Link: Introduction to Algorithms (4th Edition)

    2. Handbook of Natural Language Processing (Second Edition)
      Editor: Dan Jurafsky and James H. Martin
      Publisher: Chapman & Hall/CRC
      Link: Handbook of Natural Language Processing (Second Edition)

    3. System Analysis and Design
      Publisher: Informatics Institute
      Link: System Analysis and Design

  2. Content Extraction:

    • Extracted content from the selected textbooks, ensuring thorough coverage of all relevant text.

2. Data Chunking and RAPTOR Indexing

  1. Data Chunking:

    • Chunking the extracted content into short, contiguous texts of approximately 100 tokens each, preserving sentence boundaries.
    • Note: Use NLTK’s word_tokenize for tokenization.
  2. Embedding and Indexing:

    • Embeded the chunked texts using Sentence-BERT (SBERT) to create vector representations.
    • Implemented RAPTOR indexing with the following steps:
      • Clustering: Used Gaussian Mixture Models (GMMs) with soft clustering to allow nodes to belong to multiple clusters.
      • Summarization: Summarizing clusters using an LLM (e.g., GPT-3.5-turbo) to create concise representations.
      • Recursive Clustering and Summarization: Re-embeded the summarized texts and recursively apply clustering and summarization to form a hierarchical tree structure.
    • Stored the RAPTOR index in a MILVUS vector database, including metadata such as textbook titles and page numbers.

3. Retrieval Techniques

  1. Query Expansion:

    • Implement query expansion techniques such as synonym expansion, stemming, or using external knowledge bases.
  2. Hybrid Retrieval Methods:

    • Combine BM25 (Best Match 25) with BERT/bi-encoder-based retrieval methods like Dense Passage Retrieval (DPR) and Semantic Passage Retrieval (SPIDER).
  3. Re-ranking:

    • Re-rank retrieved data based on relevance and similarity using appropriate ranking algorithms.

4. Question Answering

  1. LLM Integration:

    • Use an LLM (e.g., OpenAI’s GPT-3.5-turbo) to generate accurate and relevant answers based on the retrieved data.
  2. Fallback to Wikipedia:

    • If no relevant context is available, use Wikipedia as a fallback for answering questions. Handle Wikipedia API requests and error handling properly.

5. User Interface (Optional)

  1. UI Development:
    • Developed a user interface using frameworks like Streamlit to demonstrate the system’s functionality.
    • The interface should allow users to input queries and view retrieved answers along with corresponding textbook titles and page numbers.

streamlit interface:

resolvabot llm

Installation and Setup

Prerequisites

  • Python 3.7 or later
  • Git

Clone the Repository

git clone https://github.com/Anamicca23/ResolvaBot-LLM.git
cd ResolvaBot-LLM

Create a Virtual Environment

python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Install Dependencies

pip install -r requirements.txt

Setup Environment Variables

  • Create a .env file in the root directory and add your OpenAI API key:
OPENAI_API_KEY=your_openai_api_key
  • For Wikipedia API, ensure your user_agent is set appropriately in your code.

Initialize the Index

  1. Run the script to create the index from the textbooks:
python Raptor indexing.py

Run the Application

  1. Start the Streamlit app interface:
  • make sure you replace path of directory or files.
streamlit run Resolvabotapp.py

Usage

  1. Upload Textbooks: Use the Streamlit interface to upload textbooks for indexing.
  2. Query the System: Input queries into the interface to retrieve and view answers.
  3. View Results: Examine the results displayed, including relevant context and answers.

Evaluation Criteria

  • Appropriateness of textbooks and completeness of content extraction.
  • Effectiveness of data chunking and RAPTOR indexing processes.
  • Quality of retrieval techniques, including query expansion and hybrid methods.
  • Relevance and accuracy of re-ranking algorithms.
  • Accuracy of LLM-generated answers.
  • Overall system performance and efficiency.
  • (Optional) User interface design and user experience.

Resources

About

A sophisticated platform leveraging advanced large language models (LLMs) for the dynamic resolution of user queries, utilizing state-of-the-art natural language processing (NLP) and machine learning algorithms to facilitate context-aware and precise information retrieval and problem-solving.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages