AI-Powered Data Extraction and Storage System

Overview

This AI-powered system is designed to extract, validate, structure, and store large volumes of unstructured data efficiently. It uses semantic search with vectorized storage to enable fast and intelligent information retrieval, making it ideal for Retrieval-Augmented Generation (RAG) applications.

Features

Automated Data Extraction: Extracts data in chunks using intelligent crawling techniques.
Structured Storage: Stores extracted data in a format enriched with metadata for easy retrieval.
Semantic Search: Integrates vector search using Supabase to enable context-aware information lookup.
Data Validation: Ensures consistency and accuracy of extracted data using PydanticAI.
RAG-Ready: Supports downstream AI tasks like document-based question answering and summarization.

Tech Stack

Backend: Python
Data Extraction: Crawl4AI
Data Validation & Structuring: PydanticAI
Storage: Supabase (with vector column)
AI Integration: Retrieval-Augmented Generation (RAG), Semantic Search

Installation

Prerequisites

Python 3.9+
Supabase Account

Setup Instructions

Clone the repository:

git clone https://github.com/e-d-i-n-i/ai-data-extraction.git
cd ai-data-extraction

Create a virtual environment and activate it:

python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

Install the dependencies:
```
pip install -r requirements.txt
```

Set up your environment variables in a .env file:

SUPABASE_URL=your_supabase_url
SUPABASE_KEY=your_supabase_api_key

Run the system:
```
python main.py
```

Usage

Configure data sources in the system.
Run the extractor to crawl and fetch unstructured data.
Validate and structure data using PydanticAI.
Store structured data in Supabase with vector embedding.
Perform semantic search or integrate with RAG pipelines for intelligent applications.

Contributing

We welcome your contributions!

Fork the repository
Create a feature branch
Make your changes
Submit a pull request

License

This project is licensed under the MIT License.

Contact

For questions or suggestions, contact Edini Amare at [edini.amare.gw@gmail.com] or visit [www.edini.dev].

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
backend		backend
frontend		frontend
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI-Powered Data Extraction and Storage System

Overview

Features

Tech Stack

Installation

Prerequisites

Setup Instructions

Usage

Contributing

License

Contact

About

Uh oh!

Uh oh!

Languages

License

e-d-i-n-i/ai-data-extraction

Folders and files

Latest commit

History

Repository files navigation

AI-Powered Data Extraction and Storage System

Overview

Features

Tech Stack

Installation

Prerequisites

Setup Instructions

Usage

Contributing

License

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages