RooCode Data Query Application

Overview

The RooCode Data Query Application is a Python-based tool designed to provide a chat-based interface for querying a specialized knowledge base. It utilizes Streamlit for its professional, dark-themed user interface, allowing users to interact intuitively with data. The knowledge base is built from information scraped from Reddit (specifically the "RooCode" community or user-configured subreddits) and potentially other text sources like GitHub repository data. This application leverages local Large Language Models (LLMs) served via Ollama and employs ChromaDB as a vector store to enable Retrieval Augmented Generation (RAG), delivering informed and contextually relevant answers.

Features

Chat-Based Querying: Interact with your data using natural language questions.
Configurable Reddit Data Scraping: Tailor data collection from specific subreddits with adjustable post limits.
Vector Store Ingestion: A robust pipeline processes text data and populates a ChromaDB vector store for efficient semantic search.
Local LLM Integration: Seamlessly connects with local LLMs hosted by Ollama.
LLM Selection: Dynamically choose from available Ollama models directly within the UI.
Polished UI: A dark-themed, professional interface built with Streamlit, featuring a sidebar for controls and clear chat display.
Chat Management: Easily clear chat history for a fresh start.
Performance Metrics: Displays response time for each AI-generated answer.

Tech Stack

Core: Python 3.7+
Web UI: Streamlit
LLM Integration: Ollama, ollama Python library
RAG & Orchestration: Langchain
Vector Database: ChromaDB (via langchain_chroma)
Embeddings: HuggingFace Sentence Transformers (via langchain_huggingface, e.g., sentence-transformers/all-MiniLM-L6-v2)
Reddit Scraping: PRAW (Python Reddit API Wrapper)
Configuration Database: TinyDB (for storing agent/model settings)

Prerequisites

Python: Version 3.7 or higher.
Ollama: The Ollama service must be installed, running, and accessible.
- Ensure desired models (e.g., Llama3, Qwen2.5) are downloaded:
```
ollama pull llama3:8b
ollama pull qwen2.5:1.5b 
# Add any other models you wish to use
```
Git: Required for cloning the repository.

Installation & Setup

Clone the Repository:

git clone <your_repository_url> 
# Example: git clone https://github.com/yourusername/roocode-data-query.git
cd <repository_directory> 
# Example: cd roocode-data-query

Install Dependencies: It's highly recommended to use a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`
pip install -r requirements.txt

Configure Reddit API Access:
- Copy the example Reddit configuration file:
```
cp reddit.config.example.json reddit.config.json
```
- Edit reddit.config.json with your own Reddit API client_id, client_secret, and user_agent.
  - To obtain these credentials, visit Reddit's app preferences page, create a new application (select "script" type). Your client_id is under the personal use script section, and client_secret is also provided there. The user_agent can be a descriptive string (e.g., "RooCodeQueryApp/0.1 by YourUsername").
- Optionally, customize subreddit (e.g., "learnpython", "LocalLLaMA"), post_limit, and output_file within reddit.config.json. It is recommended to keep output_file as "reddit_data.txt" as this is the default expected by ingest.py.
Prepare GitHub Data (Optional):
- If you have relevant GitHub data (e.g., code snippets, documentation excerpts, issue discussions), save this information as plain text in a file named github_data.txt in the root directory of the project.
- The ingest.py script will automatically look for this file and process its content if it exists.

Running the Application

Make sure your Ollama service is running before starting the application.

Step 1: Scrape Reddit Data:
- Ensure reddit.config.json is correctly configured.
- Run the Reddit scraper script from the project's root directory:
```
python scrape_reddit.py
```
- This will create or update the reddit_data.txt file (or the filename specified in your config) with the fetched data.
Step 2: Ingest Data into Vector Store:
- Run the ingestion script from the project's root directory:
```
python ingest.py
```
- This script processes the text from reddit_data.txt (and github_data.txt if it exists) and populates or updates the local ChromaDB vector store located in the ./chroma_db directory.
Step 3: Launch the Streamlit Application:
- Run the Streamlit app from the project's root directory:
```
streamlit run app.py
```
- Streamlit will typically provide a local URL (e.g., http://localhost:8501). Open this URL in your web browser to interact with the RooCode Data Query application.

Project Documentation

This project includes comprehensive documentation to help you understand its architecture, setup, and usage:

Product Requirements Document (PRODUCT_REQUIREMENTS.md): Detailed overview of the application's goals, features, and target audience.
Backend Documentation (BACKEND_DOCUMENTATION.md): In-depth information about the backend components, data flow, and key libraries.
Frontend Documentation (FRONTEND_DOCUMENTATION.md): Description of the UI structure, styling, and user interaction logic.
User Flow Documentation (USER_FLOW_DOCUMENTATION.md): Step-by-step guide through typical user journeys within the application.
Automation Guide (AUTOMATION_GUIDE.md): Instructions for setting up automated daily data updates and notifications.

Troubleshooting

Ollama Connection Issues:
- Ensure the Ollama service is running and accessible on your system.
- Verify that the models you intend to use (e.g., llama3:8b) have been pulled using ollama pull <model_name>.
- Check if the model names in the Streamlit UI match those available in Ollama. Use the "Refresh Models" button in the UI.
reddit.config.json Errors:
- Ensure the file reddit.config.json exists in the root directory.
- Double-check that your Reddit API client_id, client_secret, and user_agent are correct and that the JSON structure is valid.
Python Dependencies & pip install Issues:
- Make sure you are using a compatible Python version (3.7+).
- If pip install -r requirements.txt fails, check your internet connection. Try upgrading pip (pip install --upgrade pip).
- Ensure you have activated your virtual environment if you are using one.
"No relevant context found" or Unsatisfactory Answers:
- This may indicate that the scraped data or github_data.txt does not contain information relevant to your query. Consider expanding your data sources or refining your scraping parameters in reddit.config.json.
- Ensure the ingest.py script ran successfully after updating data sources.
- Experiment with different LLMs available through Ollama, as some may perform better on certain types of queries.

License

Refer to the LICENSE file for licensing information regarding this project. (If a LICENSE file is not present, you might consider adding one, e.g., MIT License).

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
chroma_db		chroma_db
ollama-rag		ollama-rag
.DS_Store		.DS_Store
.gitignore		.gitignore
AUTOMATION_GUIDE.md		AUTOMATION_GUIDE.md
BACKEND_DOCUMENTATION.md		BACKEND_DOCUMENTATION.md
FRONTEND_DOCUMENTATION.md		FRONTEND_DOCUMENTATION.md
LICENSE		LICENSE
Modelfile-template		Modelfile-template
PRODUCT_REQUIREMENTS.md		PRODUCT_REQUIREMENTS.md
README.md		README.md
SETUP_GUIDE.md		SETUP_GUIDE.md
USER_FLOW_DOCUMENTATION.md		USER_FLOW_DOCUMENTATION.md
__init__.py		__init__.py
app.py		app.py
cleanRun.sh		cleanRun.sh
cleanup-script.sh		cleanup-script.sh
config-template.json		config-template.json
config.example.json		config.example.json
converse.py		converse.py
db.json		db.json
ingest-pdf.py		ingest-pdf.py
ingest.py		ingest.py
install-script.sh		install-script.sh
notify_pushover.py		notify_pushover.py
pushover.config.example.json		pushover.config.example.json
reddit.config.example.json		reddit.config.example.json
requirements.txt		requirements.txt
reset.sh		reset.sh
run.sh		run.sh
run_daily_update.py		run_daily_update.py
scrape-pdf-list.sh		scrape-pdf-list.sh
scrape_github.py		scrape_github.py
scrape_reddit.py		scrape_reddit.py
setup-script.sh		setup-script.sh
setup.py		setup.py
troubleshoot.txt		troubleshoot.txt
web_search.py		web_search.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RooCode Data Query Application

Overview

Features

Tech Stack

Prerequisites

Installation & Setup

Running the Application

Project Documentation

Troubleshooting

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

itrimble/Scraper2Notebook

Folders and files

Latest commit

History

Repository files navigation

RooCode Data Query Application

Overview

Features

Tech Stack

Prerequisites

Installation & Setup

Running the Application

Project Documentation

Troubleshooting

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages