DataTalk is a comprehensive NL2SQL (Natural Language to SQL) system that allows users to upload CSV files and query them using natural language. The system consists of two main components: data ingestion and an intelligent agent for natural language query processing.
The repository is organized into two distinct parts:
- Ingestion System (
ingestion/
+new_frontend/
+serve_frontend.py
) - Handles CSV file uploads and data processing - Agent System (
agent/
) - Provides natural language to SQL query capabilities
The ingestion system handles CSV file upload, processing, and database creation.
serve_frontend.py
- Flask web server that serves the frontend interfacenew_frontend/
- Web interface for file uploadsingestion/
- Core data ingestion logic
- Web Interface: Upload CSV files through a web interface
- Automatic Schema Detection: Intelligently detects column types (dates, numbers, text, etc.)
- PostgreSQL Integration: Creates optimized database schemas
- Batch Processing: Handles large CSV files efficiently
- Data Type Conversion: Converts various formats (percentages, dates, timestamps) to PostgreSQL types
python serve_frontend.py
The web interface will be available at http://localhost:5678
The agent system provides natural language query capabilities using advanced NL2SQL techniques.
agent/datatalk_runner.py
- Main agent runneragent/kraken/
- Core agent implementationagent.py
- Main agent logicstate.py
- State managementsql_utils.py
- SQL utilitiesprompts/
- LLM prompts
- Natural Language Processing: Convert English queries to SQL
- Context Awareness: Maintains conversation history
- SQL Generation: Creates optimized SQL queries
- Result Explanation: Provides human-readable explanations of results
- Error Handling: Robust error handling and recovery
- Cost Tracking: Monitors LLM usage costs
agent/llm_config.yaml
- LLM configuration (OpenAI API settings)
- Python 3.8+
- PostgreSQL database
- OpenAI API key (configured in environment or config file)
- Clone the repository:
git clone <repository-url>
cd datatalk
- Install dependencies:
pip install -r requirements.txt # if available
- Configure database connection and API keys in
agent/llm_config.yaml
. Please refer to the chainlite repository (https://github.com/stanford-oval/chainlite) on the details.
- Start the ingestion server:
python serve_frontend.py
-
Upload CSV files through the web interface at
http://localhost:5678
-
Query your data using the agent:
python agent/datatalk_runner.py
- Data Upload: Use the web interface to upload CSV files
- Automatic Processing: The system automatically:
- Analyzes column types
- Creates PostgreSQL tables
- Generates lookup tables
- Creates database schemas
- Natural Language Queries: Use the agent to ask questions about your data in plain English
- SQL Generation: The agent converts your questions to SQL queries
- Results: Get both the SQL query and human-readable explanations
datatalk/
├── agent/ # NL2SQL Agent System
│ ├── datatalk_runner.py # Main agent entry point
│ ├── kraken/ # Core agent implementation
│ │ ├── agent.py # Agent logic
│ │ ├── state.py # State management
│ │ ├── sql_utils.py # SQL utilities
│ │ ├── utils.py # Utility functions
│ │ └── prompts/ # LLM prompts
│ └── llm_config.yaml # LLM configuration
├── ingestion/ # Data Ingestion System
│ ├── ingestion.py # Core ingestion logic
│ ├── ingestion_createdb.py # Database creation
│ ├── ingestion_tools.py # Ingestion utilities
│ └── from_db/ # Database import tools
├── new_frontend/ # Web Interface
│ ├── templates/ # HTML templates
│ ├── css/ # Stylesheets
│ ├── js/ # JavaScript files
│ └── img/ # Images
└── serve_frontend.py # Flask web server
- Support for large CSV files (up to 10GB)
- Automatic data type detection and conversion
- PostgreSQL schema generation
- Batch processing for performance
- Declaration file support for custom schema definitions
- Convert English questions to SQL queries
- Conversation history and context awareness
- Multiple query types supported
- Result explanation and limitations
- Suggested follow-up questions
- Cost tracking and monitoring
- Drag-and-drop file upload
- Real-time processing feedback
- Database management
- File validation and error handling
The system uses PostgreSQL with specific user roles:
creator_role
- For creating tables and schemasselect_user
- For read-only access
Configure OpenAI API settings in agent/llm_config.yaml
:
- API key and base URL
- Model selection (GPT-4o, GPT-4o-mini)
- Prompt logging settings
The system is designed to be modular and extensible. Key areas for contribution:
- Additional data source connectors
- Enhanced NL2SQL capabilities
- Improved web interface
- Performance optimizations
[License information would go here]