The RooCode Data Query Application is a Python-based tool designed to provide a chat-based interface for querying a specialized knowledge base. It utilizes Streamlit for its professional, dark-themed user interface, allowing users to interact intuitively with data. The knowledge base is built from information scraped from Reddit (specifically the "RooCode" community or user-configured subreddits) and potentially other text sources like GitHub repository data. This application leverages local Large Language Models (LLMs) served via Ollama and employs ChromaDB as a vector store to enable Retrieval Augmented Generation (RAG), delivering informed and contextually relevant answers.
- Chat-Based Querying: Interact with your data using natural language questions.
- Configurable Reddit Data Scraping: Tailor data collection from specific subreddits with adjustable post limits.
- Vector Store Ingestion: A robust pipeline processes text data and populates a ChromaDB vector store for efficient semantic search.
- Local LLM Integration: Seamlessly connects with local LLMs hosted by Ollama.
- LLM Selection: Dynamically choose from available Ollama models directly within the UI.
- Polished UI: A dark-themed, professional interface built with Streamlit, featuring a sidebar for controls and clear chat display.
- Chat Management: Easily clear chat history for a fresh start.
- Performance Metrics: Displays response time for each AI-generated answer.
- Core: Python 3.7+
- Web UI: Streamlit
- LLM Integration: Ollama,
ollama
Python library - RAG & Orchestration: Langchain
- Vector Database: ChromaDB (via
langchain_chroma
) - Embeddings: HuggingFace Sentence Transformers (via
langchain_huggingface
, e.g.,sentence-transformers/all-MiniLM-L6-v2
) - Reddit Scraping: PRAW (Python Reddit API Wrapper)
- Configuration Database: TinyDB (for storing agent/model settings)
- Python: Version 3.7 or higher.
- Ollama: The Ollama service must be installed, running, and accessible.
- Ensure desired models (e.g., Llama3, Qwen2.5) are downloaded:
ollama pull llama3:8b ollama pull qwen2.5:1.5b # Add any other models you wish to use
- Ensure desired models (e.g., Llama3, Qwen2.5) are downloaded:
- Git: Required for cloning the repository.
-
Clone the Repository:
git clone <your_repository_url> # Example: git clone https://github.com/yourusername/roocode-data-query.git cd <repository_directory> # Example: cd roocode-data-query
-
Install Dependencies: It's highly recommended to use a virtual environment:
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate` pip install -r requirements.txt
-
Configure Reddit API Access:
- Copy the example Reddit configuration file:
cp reddit.config.example.json reddit.config.json
- Edit
reddit.config.json
with your own Reddit APIclient_id
,client_secret
, anduser_agent
.- To obtain these credentials, visit Reddit's app preferences page, create a new application (select "script" type). Your
client_id
is under the personal use script section, andclient_secret
is also provided there. Theuser_agent
can be a descriptive string (e.g., "RooCodeQueryApp/0.1 by YourUsername").
- To obtain these credentials, visit Reddit's app preferences page, create a new application (select "script" type). Your
- Optionally, customize
subreddit
(e.g., "learnpython", "LocalLLaMA"),post_limit
, andoutput_file
withinreddit.config.json
. It is recommended to keepoutput_file
as"reddit_data.txt"
as this is the default expected byingest.py
.
- Copy the example Reddit configuration file:
-
Prepare GitHub Data (Optional):
- If you have relevant GitHub data (e.g., code snippets, documentation excerpts, issue discussions), save this information as plain text in a file named
github_data.txt
in the root directory of the project. - The
ingest.py
script will automatically look for this file and process its content if it exists.
- If you have relevant GitHub data (e.g., code snippets, documentation excerpts, issue discussions), save this information as plain text in a file named
Make sure your Ollama service is running before starting the application.
-
Step 1: Scrape Reddit Data:
- Ensure
reddit.config.json
is correctly configured. - Run the Reddit scraper script from the project's root directory:
python scrape_reddit.py
- This will create or update the
reddit_data.txt
file (or the filename specified in your config) with the fetched data.
- Ensure
-
Step 2: Ingest Data into Vector Store:
- Run the ingestion script from the project's root directory:
python ingest.py
- This script processes the text from
reddit_data.txt
(andgithub_data.txt
if it exists) and populates or updates the local ChromaDB vector store located in the./chroma_db
directory.
- Run the ingestion script from the project's root directory:
-
Step 3: Launch the Streamlit Application:
- Run the Streamlit app from the project's root directory:
streamlit run app.py
- Streamlit will typically provide a local URL (e.g.,
http://localhost:8501
). Open this URL in your web browser to interact with the RooCode Data Query application.
- Run the Streamlit app from the project's root directory:
This project includes comprehensive documentation to help you understand its architecture, setup, and usage:
-
Product Requirements Document (PRODUCT_REQUIREMENTS.md): Detailed overview of the application's goals, features, and target audience.
-
Backend Documentation (BACKEND_DOCUMENTATION.md): In-depth information about the backend components, data flow, and key libraries.
-
Frontend Documentation (FRONTEND_DOCUMENTATION.md): Description of the UI structure, styling, and user interaction logic.
-
User Flow Documentation (USER_FLOW_DOCUMENTATION.md): Step-by-step guide through typical user journeys within the application.
-
Automation Guide (AUTOMATION_GUIDE.md): Instructions for setting up automated daily data updates and notifications.
- Ollama Connection Issues:
- Ensure the Ollama service is running and accessible on your system.
- Verify that the models you intend to use (e.g.,
llama3:8b
) have been pulled usingollama pull <model_name>
. - Check if the model names in the Streamlit UI match those available in Ollama. Use the "Refresh Models" button in the UI.
reddit.config.json
Errors:- Ensure the file
reddit.config.json
exists in the root directory. - Double-check that your Reddit API
client_id
,client_secret
, anduser_agent
are correct and that the JSON structure is valid.
- Ensure the file
- Python Dependencies &
pip install
Issues:- Make sure you are using a compatible Python version (3.7+).
- If
pip install -r requirements.txt
fails, check your internet connection. Try upgrading pip (pip install --upgrade pip
). - Ensure you have activated your virtual environment if you are using one.
- "No relevant context found" or Unsatisfactory Answers:
- This may indicate that the scraped data or
github_data.txt
does not contain information relevant to your query. Consider expanding your data sources or refining your scraping parameters inreddit.config.json
. - Ensure the
ingest.py
script ran successfully after updating data sources. - Experiment with different LLMs available through Ollama, as some may perform better on certain types of queries.
- This may indicate that the scraped data or
Refer to the LICENSE
file for licensing information regarding this project.
(If a LICENSE file is not present, you might consider adding one, e.g., MIT License).