A powerful documentation crawler and knowledge base system that processes and stores documentation from ethdocker.com with advanced semantic search capabilities.
- 🕷️ Asynchronous web crawling with parallel processing
- 🧠 Semantic chunking with context preservation
- 🔍 Advanced vector search using OpenAI embeddings
- 📚 Version control and document history tracking
- 🔗 Hierarchical document structure with linked chunks
- 🏷️ Automatic keyword extraction and categorization
- ⚡ High-performance PostgreSQL storage with pgvector
- 🔄 Intelligent conflict resolution and version management
- 💬 Interactive Streamlit chat interface with ETHDocker expert
- 🚀 RESTful API endpoint for ETHDocker expert integration
- 📝 Conversation history tracking and management
- Fetches and processes documentation from ethdocker.com
- Implements semantic chunking and versioning
- Handles document storage and updates
- Implements the ETHDocker expert agent
- Provides semantic search and document retrieval
- Features:
- RAG-based document retrieval
- Context-aware responses
- Section hierarchy navigation
- Version history tracking
- Keyword-based filtering
- Tool-based architecture for extensibility
- Interactive web interface for the expert system
- Real-time streaming responses
- Tool usage transparency
- Conversation management
- RESTful API for ETHDocker expert integration
- Features:
- Bearer token authentication
- Conversation history management
- Error handling and logging
- Client information tracking
- Health check endpoint
- CORS support
- Supabase integration for message storage
- Python 3.8+
- PostgreSQL with pgvector extension
- Supabase account (for hosted database)
- OpenAI API key
- Clone the repository and set up a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
- Copy the environment template and fill in your credentials:
cp .env.example .env
- Configure your
.env
file with:
OPENAI_API_KEY=your_openai_api_key
SUPABASE_URL=your_supabase_url
SUPABASE_SERVICE_KEY=your_supabase_service_key
API_BEARER_TOKEN=your_api_token
LLM_MODEL=gpt-4-turbo-preview # or your preferred model
PORT=8000 # Optional, defaults to 8000
- Set up the database schemas:
# Using psql or your preferred PostgreSQL client
psql -d your_database -f site_pages.sql
psql -d your_database -f ethdocker_messages.sql
Run the crawler to fetch and process documentation:
python crawl_ethdocker_ai_docs.py
The crawler will:
- Fetch URLs from the ethdocker.com sitemap
- Process documents in parallel with controlled concurrency
- Split content into semantic chunks with context preservation
- Generate embeddings and extract metadata
- Store processed content with version control
Launch the Streamlit-based chat interface:
streamlit run streamlit.py
Features:
- 🤖 Interactive conversations with ETHDocker expert
- 📚 Real-time access to ETHDocker documentation
- 🔍 Semantic search capabilities
- 🔧 Transparent tool usage with expandable details
- 💾 Conversation history management
- ℹ️ Quick access to key information via sidebar
- 🧹 Clear chat history functionality
Start the API server:
python ethdocker_endpoint.py
The API will be available at http://localhost:8000/api/ethdocker-expert
.
Example API request:
curl -X POST http://localhost:8000/api/ethdocker-expert \
-H "Authorization: Bearer your_api_token" \
-H "Content-Type: application/json" \
-d '{
"query": "What are the hardware requirements for ETHDocker?",
"user_id": "user123",
"request_id": "req123",
"session_id": "session123"
}'
Health check endpoint:
curl http://localhost:8000/api/health
- Vector similarity search using pgvector
- Full-text search capabilities
- Document version history
- Hierarchical document structure
- Keyword-based filtering
- Metadata-based querying
- Session-based message storage
- JSON message format
- Timestamp tracking
- User and request tracking
- Client information storage
- Error message handling
- Crawling: Asynchronous crawling with rate limiting and error handling
- Chunking: Smart text splitting with semantic boundary detection
- Enrichment:
- Title and summary generation using GPT-4
- Keyword extraction
- Section hierarchy tracking
- Embedding generation
- Storage:
- Conflict resolution
- Version management
- Linked chunk references
-
Authentication:
- Bearer token validation
- Row-level security in Supabase
-
Conversation Management:
- Session-based history
- Message persistence
- Error tracking
-
Response Handling:
- Streaming support
- Error recovery
- Client feedback
- Parallel processing with controlled concurrency
- Efficient database indexing
- Caching and retry mechanisms
- Batch operations for better throughput
The system includes:
- Automatic retries with exponential backoff
- Comprehensive logging
- Transaction management
- Conflict resolution
- Failure recovery
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
- OpenAI for embedding and GPT-4 APIs
- Supabase for hosted PostgreSQL
- pgvector for vector similarity search
- Streamlit for the interactive interface
- FastAPI for the REST API endpoint