A complete pipeline for crawling, processing, indexing, and searching content with semantic capabilities.
Playwright Crawler → Content Processor → Elasticsearch + Embeddings → FastAPI Search Service
- JavaScript-rendered content support via Playwright
- Structured content extraction (sections, subsections)
- Media and link tracking
- Rate limiting and robots.txt compliance
- Text extraction and cleaning
- Section hierarchy preservation
- Statistics generation:
- Word and sentence counts
- Internal/external link counts
- Media asset tracking
- Section structure analysis
- Hybrid storage with Elasticsearch
- Vector embeddings (SBERT all-MiniLM-L6-v2)
- Text chunking for optimal embedding
- Content statistics and metadata
- Automatic backup/restore capabilities
- Hybrid search combining:
- BM25 text similarity
- Vector similarity (cosine)
- Smart result ranking
- Auto-suggestions
- Performance metrics
- OpenAPI documentation
- Python 3.12+
- Docker and Docker Compose
- ~1GB free memory
- Poetry for dependency management
# Clone the repository
git clone https://github.com/etvincen/boredapi
cd boredapi
# Configure environment
cp .env.example .env
# Edit .env with your settings
Follow Docker Installation instructions
# Dowload Poetry
curl -sSL https://install.python-poetry.org | python3 -
# Configure Poetry
poetry config virtualenvs.path
poetry config virtualenvs.in-project true
# Install dependencies and generate lock file
poetry install
# Activate the virtual environment
source .venv/bin/activate
# Install Playwright browser
poetry run playwright install chromium
# 1. Start Elasticsearch and Kibana
docker-compose -f docker/docker-compose.dev.yml up -d elasticsearch kibana
# 2. Run the crawler
poetry run python -m src.cli.crawler_cli
# 3. Ingest data into Elasticsearch
poetry run python -m src.cli.ingest_data
# 4. Start the API service locally
poetry run uvicorn src.api.main:app --host 0.0.0.0 --port 8000 --reload
- Check Elasticsearch: http://localhost:9200
- Check Kibana: http://localhost:5601
- Check API docs: http://localhost:8000/docs
The API runs at http://localhost:8000 with these endpoints:
curl "http://localhost:8000/search?q=Comment%20pr%C3%A9parer%20ses%20obs%C3%A8ques&size=3"
Parameters:
q
: Search query (required)size
: Number of results (default: 10)min_score
: Minimum score thresholdinclude_stats
: Include document statistics
curl "http://localhost:8000/suggest?q=obs&size=5"
- OpenAPI docs: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
- Elasticsearch: 400MB memory
- Search API: 384MB memory
- Kibana: 384MB memory
- Crawling: ~1-2 pages/second
- Indexing: ~5-10 documents/second
- Search latency: ~200-500ms
- Tested with:
- ~40 pages
- ~5000-8000 words per page
- ~200KB average document size
# Test embeddings
poetry run python src/nlp/test_embeddings.py
# Test search
poetry run python src/nlp/test_vector_search.py
- Elasticsearch: http://localhost:9200
- Kibana: http://localhost:5601
Current known limitations:
- Single-node Elasticsearch (development setup)
- Limited to ~5 chunks per document for embeddings
- Basic auto-suggestions (title-based only)
- No authentication on the search API
- Memory-optimized for small-medium content sets
Potential enhancements:
- Multi-node Elasticsearch setup
- Advanced query understanding
- Search result highlighting
- User feedback integration
- Authentication and rate limiting
- Content type facets and filters