Autonomous Data Scientist Agent

An autonomous data scientist agent that uses the ReAct (Reasoning, Acting) paradigm to perform data analysis tasks. The agent can ingest data, clean it, perform exploratory data analysis (EDA), generate visualizations, and compile reports.

Features

ReAct Paradigm: Alternates between reasoning ("Thought"), acting ("Action"), and observing results ("Observation")
Data Ingestion: Reads data from Excel files and SQL databases
Data Cleaning: Handles missing values, removes duplicates, converts data types
Exploratory Data Analysis: Computes statistics, identifies correlations, analyzes features
Visualization: Generates various types of plots (line, bar, scatter, histogram, boxplot, heatmap)
Reporting: Creates HTML or PDF reports with summaries and visualizations
LLM Integration: Uses an LLM (e.g., GPT-4) for natural language reasoning

Installation

Clone this repository:

git clone <repository-url>
cd data-scientist-agent

Install dependencies:
```
pip install -r requirements.txt
```
Set up your environment variables:
- Create a .env file with your OpenAI API key:
```
OPENAI_API_KEY=your_api_key_here
```

Project Structure

data-scientist-agent/
│
├── main_controller.py      # Main controller implementing the ReAct loop
├── llm_interface.py        # Interface for LLM API calls
├── logging_module.py       # Logging module for tracking Thoughts, Actions, Observations
├── tools/
│   ├── __init__.py         # Package initialization
│   ├── data_ingestion.py   # Functions for reading data
│   ├── data_cleaning.py    # Functions for cleaning data
│   ├── eda.py              # Functions for exploratory data analysis
│   ├── visualization.py    # Functions for generating plots
│   └── reporting.py        # Functions for generating reports
├── logs/                   # Log files (created at runtime)
├── output/                 # Output files (created at runtime)
│   ├── plots/              # Generated plots
│   └── reports/            # Generated reports
└── requirements.txt        # Project dependencies

Usage

Run the agent with a task description:

python main_controller.py "Analyze sales data from data/sales.xlsx and generate a report with visualizations"

Optional arguments:

--api-key: Specify the API key (alternative to using a .env file)
--model: Specify the model to use (default: gpt-4)
--max-iterations: Maximum number of iterations (default: 10)

Example Tasks

Basic data analysis:

python main_controller.py "Load data/sales.xlsx, clean it, and generate summary statistics"

Complete analysis with visualizations:

python main_controller.py "Analyze Q4 sales data from data/sales_q4.xlsx, identify trends, and generate a report with visualizations of the top-selling products"

SQL database analysis:

python main_controller.py "Connect to the SQLite database at data/customers.db, analyze customer demographics, and create a PDF report with visualizations"

Extending the Agent

The agent is designed to be modular and extensible:

Add new tool functions to the appropriate files in the tools/ directory
The main controller will automatically detect and make these functions available to the agent

Logs and Traceability

All agent operations are logged with timestamp, thought process, actions taken, and observations. This provides traceability and helps diagnose issues.

Logs are stored in two formats:

Text logs in the logs/ directory
Structured JSON logs for machine processing

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
logs		logs
output		output
tools		tools
.gitignore		.gitignore
README.md		README.md
create_sample_data.py		create_sample_data.py
example.py		example.py
llm_interface.py		llm_interface.py
logging_module.py		logging_module.py
main_controller.py		main_controller.py
requirements.txt		requirements.txt
research_report.md		research_report.md
test_setup.py		test_setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Autonomous Data Scientist Agent

Features

Installation

Project Structure

Usage

Example Tasks

Extending the Agent

Logs and Traceability

License

About

Uh oh!

Releases

Packages

Languages

serhanylmz/data-scientist-agent

Folders and files

Latest commit

History

Repository files navigation

Autonomous Data Scientist Agent

Features

Installation

Project Structure

Usage

Example Tasks

Extending the Agent

Logs and Traceability

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages