This application is a Streamlit-based interactive Q&A system that uses Retrieval Augmented Generation (RAG) to answer questions based on the contents of a CSV file uploaded by the user. The uploaded CSV file is indexed into a Chroma vector database, where the content is embedded using the SentenceTransformers
model for efficient and scalable retrieval. Once the data is indexed, users can interact with the system by querying the dataset and receive responses powered by a language model.
- Upload CSV Files: Upload CSV files containing the Twitter/X data that you want to query.
- Data Storage with VectorDB: CSV data is converted into embeddings and stored in a vector database (ChromaDB) for fast retrieval.
- Interactive Question-Answering: Users can ask questions about the CSV data via an interactive chat interface, and the AI provides relevant answers based on the stored embeddings.
- Seamless User Experience: The chat interface is designed to feel like a natural conversation, with message bubbles for both user and AI responses.
- Persistent Chat History: The app keeps track of the entire conversation for a continuous and dynamic user experience.
To run the project locally, follow these steps:
- Python 3.10+
- Streamlit
- LangChain
- Ollama
- DeepSeek-R1 (served via Ollama)
- ChromaDB
-
Clone the repository:
git clone https://github.com/vishukla/streamlit-qa-csv-rag.git cd streamlit-qa-csv-rag
-
Create a virtual environment and activate it:
python -m venv venv source venv/bin/activate # On Windows use: venv\Scripts\activate
-
Install the required dependencies:
pip install -r requirements.txt
Once the environment is set up, you can run the Streamlit app:
streamlit run app.py
This will launch a local Streamlit server, and the app will be accessible in your web browser at http://localhost:8501
.
- Upload a CSV file containing the tweets by clicking the Upload CSV to Vector DB tab.
- The CSV file should have the following columns:
body
: The textual content to be indexed.id
: A unique identifier for each row (e.g., tweet ID, document ID).hashtags
,author
,mentions
: Metadata associated with each row. The data will be indexed into the Chroma vector store, and you’ll see a preview of the CSV data.
- Go to the "Question Answering" tab to ask questions related to the data you uploaded.
- The app uses a Retrieval-Augmented Generation (RAG) approach to find the most relevant information in the CSV and generate an AI-powered answer.
- Once the CSV is uploaded and the database is created, the user can engage with an interactive chat interface where they can type queries.
- Each user query and AI response is displayed as a message bubble.
- The interface supports continuous conversation, keeping track of the chat history.
This app uses the all-MiniLM-L6-v2
model from Sentence-Transformers
to create embeddings. You can change the embedding model by modifying the EMBEDDING_MODEL
variable in the app.py
script.
EMBEDDING_MODEL = "all-MiniLM-L6-v2" # Change to another model if desired
The vector store is persisted in the ./chroma_db
directory. You can change this directory by modifying the PERSIST_DIRECTORY
variable in the script.
PERSIST_DIRECTORY = "./chroma_db" # Modify path if desired
The current setup uses DeepSeek-R1's deepseek-r1:14b
served via Ollama. Change this by updating LLM_MODEL
variable in the script.
LLM_MODEL = "deepseek-r1:14b" # Update as needed
- ChromaDB stores the embeddings of the CSV data.
- SentenceTransformer (
all-MiniLM-L6-v2
) is used to create embeddings for the textual data in the CSV.
- The app uses the RAG approach to combine the information retrieval from the vector database with generative language models to answer user queries.
- DeepSeek-R1 model (served via Ollama for specific use case) is used for generating answers based on retrieved context.
- The user interacts with the application through a chat interface in Streamlit, which updates dynamically as the conversation progresses.
- Issue with CSV Format: Ensure your CSV contains the correct columns (e.g.,
body
,id
,hashtags
,author
,mentions
). - Model Errors: If you’re using an external language model such OpenAPI, ensure your API keys and environment variables are correctly set and LangChain model initialization in the script. Alternatively if using other models served by Ollama, make sure to update
LLM_MODEL
variable in the script. - Chroma Errors: If you encounter issues with Chroma, ensure the persistent directory is accessible, and try clearing the
./chroma_db
folder.