This AI-powered system is designed to extract, validate, structure, and store large volumes of unstructured data efficiently. It uses semantic search with vectorized storage to enable fast and intelligent information retrieval, making it ideal for Retrieval-Augmented Generation (RAG) applications.
- Automated Data Extraction: Extracts data in chunks using intelligent crawling techniques.
- Structured Storage: Stores extracted data in a format enriched with metadata for easy retrieval.
- Semantic Search: Integrates vector search using Supabase to enable context-aware information lookup.
- Data Validation: Ensures consistency and accuracy of extracted data using PydanticAI.
- RAG-Ready: Supports downstream AI tasks like document-based question answering and summarization.
- Backend: Python
- Data Extraction: Crawl4AI
- Data Validation & Structuring: PydanticAI
- Storage: Supabase (with vector column)
- AI Integration: Retrieval-Augmented Generation (RAG), Semantic Search
- Python 3.9+
- Supabase Account
-
Clone the repository:
git clone https://github.com/e-d-i-n-i/ai-data-extraction.git cd ai-data-extraction
-
Create a virtual environment and activate it:
python -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate
-
Install the dependencies:
pip install -r requirements.txt
-
Set up your environment variables in a
.env
file:SUPABASE_URL=your_supabase_url SUPABASE_KEY=your_supabase_api_key
-
Run the system:
python main.py
- Configure data sources in the system.
- Run the extractor to crawl and fetch unstructured data.
- Validate and structure data using PydanticAI.
- Store structured data in Supabase with vector embedding.
- Perform semantic search or integrate with RAG pipelines for intelligent applications.
We welcome your contributions!
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
This project is licensed under the MIT License.
For questions or suggestions, contact Edini Amare at [edini.amare.gw@gmail.com] or visit [www.edini.dev].