A hierarchical multi-agent system to mimic the job duties of a data scientist, built using LangGraph and vanna.ai.
Medium article: AI Data Scientist: An Attempt to Replace My Own Job
- Coder Agent: Generates and executes Python code, returning results.
- Data Analyst Agent: Answers questions about data using text-to-SQL.
- Slides Generator Agent: Creates PowerPoint presentations.
The agents are managed by a supervisor to fulfill user requests, with short-term memory persistence.
Visualizing the system:
-
Clone the repository
git clone https://github.com/cintiaching/ai-data-scientist cd ai-data-scientist
-
Install dependency using uv
uv sync source .venv/bin/activate
-
Create
.env
file from.env.example
and set the required environment variables.
This project uses an SQLite database to store data. By default, it pulls data from dataceo/sales-and-customer-data on Kaggle. The data will be automatically downloaded from Kaggle Hut when running ingest_data.py. You can modify the script to ingest data of your choice.
Since it uses Vanna.ai, training is required for the agent to understand your data, similar to how a data scientist
learns about their dataset. For more details, check Vanna.ai Training Documentation.
Modify train.py
to incorporate your domain knowledge and use case.
-
Navigate to
agents/llm.py
to configure the LLM settings. -
Ensure the necessary environment variables are set for LLM configuration.
-
Model Recommendation: Use a smart LLM for code generation. For options, visit the Chatbot Arena Benchmark
Example code:
python main.py
To start the localhost UI created using streamlit, run
python -m streamlit run app.py