ETHDocker Documentation Crawler and Knowledge Base

A powerful documentation crawler and knowledge base system that processes and stores documentation from ethdocker.com with advanced semantic search capabilities.

Features

🕷️ Asynchronous web crawling with parallel processing
🧠 Semantic chunking with context preservation
🔍 Advanced vector search using OpenAI embeddings
📚 Version control and document history tracking
🔗 Hierarchical document structure with linked chunks
🏷️ Automatic keyword extraction and categorization
⚡ High-performance PostgreSQL storage with pgvector
🔄 Intelligent conflict resolution and version management
💬 Interactive Streamlit chat interface with ETHDocker expert
🚀 RESTful API endpoint for ETHDocker expert integration
📝 Conversation history tracking and management

Components

Crawler (`crawl_ethdocker_ai_docs.py`)

Fetches and processes documentation from ethdocker.com
Implements semantic chunking and versioning
Handles document storage and updates

Expert System (`ethdocker_expert.py`)

Implements the ETHDocker expert agent
Provides semantic search and document retrieval
Features:
- RAG-based document retrieval
- Context-aware responses
- Section hierarchy navigation
- Version history tracking
- Keyword-based filtering
- Tool-based architecture for extensibility

Chat Interface (`streamlit.py`)

Interactive web interface for the expert system
Real-time streaming responses
Tool usage transparency
Conversation management

API Endpoint (`ethdocker_endpoint.py`)

RESTful API for ETHDocker expert integration
Features:
- Bearer token authentication
- Conversation history management
- Error handling and logging
- Client information tracking
- Health check endpoint
- CORS support
- Supabase integration for message storage

Prerequisites

Python 3.8+
PostgreSQL with pgvector extension
Supabase account (for hosted database)
OpenAI API key

Installation

Clone the repository and set up a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

Copy the environment template and fill in your credentials:

cp .env.example .env

Configure your .env file with:

OPENAI_API_KEY=your_openai_api_key
SUPABASE_URL=your_supabase_url
SUPABASE_SERVICE_KEY=your_supabase_service_key
API_BEARER_TOKEN=your_api_token
LLM_MODEL=gpt-4-turbo-preview  # or your preferred model
PORT=8000  # Optional, defaults to 8000

Set up the database schemas:

# Using psql or your preferred PostgreSQL client
psql -d your_database -f site_pages.sql
psql -d your_database -f ethdocker_messages.sql

Usage

Crawler

Run the crawler to fetch and process documentation:

python crawl_ethdocker_ai_docs.py

The crawler will:

Fetch URLs from the ethdocker.com sitemap
Process documents in parallel with controlled concurrency
Split content into semantic chunks with context preservation
Generate embeddings and extract metadata
Store processed content with version control

Interactive Chat Interface

Launch the Streamlit-based chat interface:

streamlit run streamlit.py

Features:

🤖 Interactive conversations with ETHDocker expert
📚 Real-time access to ETHDocker documentation
🔍 Semantic search capabilities
🔧 Transparent tool usage with expandable details
💾 Conversation history management
ℹ️ Quick access to key information via sidebar
🧹 Clear chat history functionality

API Endpoint

Start the API server:

python ethdocker_endpoint.py

The API will be available at http://localhost:8000/api/ethdocker-expert.

Example API request:

curl -X POST http://localhost:8000/api/ethdocker-expert \
  -H "Authorization: Bearer your_api_token" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What are the hardware requirements for ETHDocker?",
    "user_id": "user123",
    "request_id": "req123",
    "session_id": "session123"
  }'

Health check endpoint:

curl http://localhost:8000/api/health

Database Schema

Documentation Storage (`site_pages.sql`)

Vector similarity search using pgvector
Full-text search capabilities
Document version history
Hierarchical document structure
Keyword-based filtering
Metadata-based querying

Conversation History (`ethdocker_messages.sql`)

Session-based message storage
JSON message format
Timestamp tracking
User and request tracking
Client information storage
Error message handling

Architecture

Document Processing

Crawling: Asynchronous crawling with rate limiting and error handling
Chunking: Smart text splitting with semantic boundary detection
Enrichment:
- Title and summary generation using GPT-4
- Keyword extraction
- Section hierarchy tracking
- Embedding generation
Storage:
- Conflict resolution
- Version management
- Linked chunk references

API Integration

Authentication:
- Bearer token validation
- Row-level security in Supabase
Conversation Management:
- Session-based history
- Message persistence
- Error tracking
Response Handling:
- Streaming support
- Error recovery
- Client feedback

Performance Optimizations

Parallel processing with controlled concurrency
Efficient database indexing
Caching and retry mechanisms
Batch operations for better throughput

Error Handling

The system includes:

Automatic retries with exponential backoff
Comprehensive logging
Transaction management
Conflict resolution
Failure recovery

Contributing

Fork the repository
Create a feature branch
Commit your changes
Push to the branch
Create a Pull Request

License

MIT License

Acknowledgments

OpenAI for embedding and GPT-4 APIs
Supabase for hosted PostgreSQL
pgvector for vector similarity search
Streamlit for the interactive interface
FastAPI for the REST API endpoint

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ETHDocker Documentation Crawler and Knowledge Base

Features

Components

Crawler (`crawl_ethdocker_ai_docs.py`)

Expert System (`ethdocker_expert.py`)

Chat Interface (`streamlit.py`)

API Endpoint (`ethdocker_endpoint.py`)

Prerequisites

Installation

Usage

Crawler

Interactive Chat Interface

API Endpoint

Database Schema

Documentation Storage (`site_pages.sql`)

Conversation History (`ethdocker_messages.sql`)

Architecture

Document Processing

API Integration

Performance Optimizations

Error Handling

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
crawl_ethdocker_ai_docs.py		crawl_ethdocker_ai_docs.py
ethdocker_endpoint.py		ethdocker_endpoint.py
ethdocker_expert.py		ethdocker_expert.py
ethdocker_messages.sql		ethdocker_messages.sql
requirements.txt		requirements.txt
screenshot.png		screenshot.png
site_pages.sql		site_pages.sql
streamlit.py		streamlit.py

NodeBridge-Africa/ethdocker_backend_agent

Folders and files

Latest commit

History

Repository files navigation

ETHDocker Documentation Crawler and Knowledge Base

Features

Components

Crawler (crawl_ethdocker_ai_docs.py)

Expert System (ethdocker_expert.py)

Chat Interface (streamlit.py)

API Endpoint (ethdocker_endpoint.py)

Prerequisites

Installation

Usage

Crawler

Interactive Chat Interface

API Endpoint

Database Schema

Documentation Storage (site_pages.sql)

Conversation History (ethdocker_messages.sql)

Architecture

Document Processing

API Integration

Performance Optimizations

Error Handling

Contributing

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Crawler (`crawl_ethdocker_ai_docs.py`)

Expert System (`ethdocker_expert.py`)

Chat Interface (`streamlit.py`)

API Endpoint (`ethdocker_endpoint.py`)

Documentation Storage (`site_pages.sql`)

Conversation History (`ethdocker_messages.sql`)

Packages