A robust API for extracting structured event information from web pages using Google's Gemini LLM technology.
This API provides a powerful solution for extracting event information (title, description, date/time, location) from web pages using a combination of advanced web scraping techniques and Google's Gemini LLM. It handles both simple static websites and complex JavaScript-rendered pages, making it suitable for a wide range of use cases.
-
Flexible Content Fetching
- Basic HTTP content fetching for static sites
- Advanced browser-based scraping using Playwright for dynamic content
- Configurable timeouts and retry mechanisms
-
Intelligent Processing
- Smart HTML preprocessing and cleaning
- Markdown conversion for better LLM processing
- Google Gemini LLM integration for accurate information extraction
-
Production-Ready Architecture
- FastAPI-based REST API with automatic validation
- Comprehensive error handling and logging
- Rate limiting and security features
- Docker containerization with multi-stage builds
- Health checks and monitoring
- Configurable via environment variables
.
├── app/ # Main application code
│ ├── api/ # FastAPI routes and models
│ │ ├── models.py # Pydantic models for validation
│ │ └── routes.py # API endpoint definitions
│ ├── core/ # Core business logic
│ │ ├── fetchers.py # Content fetching strategies
│ │ ├── llm.py # Gemini LLM integration
│ │ └── parser.py # Response parsing and validation
│ └── main.py # Application entry point
├── tests/ # Test suite
│ ├── conftest.py # Test configuration
│ ├── test_*.py # Test modules
├── docker/ # Docker configuration
├── scripts/ # Utility scripts
├── .env.example # Example environment variables
├── .env.prod.example # Production environment template
├── Dockerfile # Development container
├── Dockerfile.prod # Production container
├── docker-compose.yml # Container orchestration
├── requirements.txt # Production dependencies
└── requirements-dev.txt # Development dependencies
- Python 3.9+
- Docker and Docker Compose
- Google Gemini API key
- Clone and setup:
git clone https://github.com/yourusername/cohere-event-scraper.git
cd cohere-event-scraper
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
pip install -r requirements-dev.txt # For development
- Configure environment:
cp .env.example .env
# Edit .env with your settings
- Run development server:
uvicorn app.main:app --reload
# Start development environment
docker-compose up app-dev
# Run tests
docker-compose run --rm app-dev pytest
# Code quality checks
docker-compose run --rm app-dev pre-commit run --all-files
- Configure production settings:
cp .env.prod.example .env.prod
# Edit .env.prod with production values
- Deploy with Docker:
docker-compose -f docker-compose.yml up app-prod -d
Extract event information from a webpage.
{
"url": "https://example.com/event-page",
"gemini_api_key": "your_api_key",
"use_playwright": false,
"custom_instructions": "Optional instructions for the LLM",
"timeout": 30000
}
{
"title": "Sample Event Title",
"description": "Detailed event description...",
"start_datetime": "2024-07-15T10:00:00Z",
"end_datetime": "2024-07-15T12:00:00Z",
"location": "123 Main St, Example City",
"metadata": {
"confidence_score": 0.95,
"extraction_method": "gemini_llm",
"processing_time_ms": 1234
}
}
{
"error": "error_type",
"message": "Human-readable error message",
"details": {
"technical_details": "Additional error context",
"request_id": "unique_request_identifier"
}
}
Status codes:
- 400: Bad Request (invalid input)
- 401: Unauthorized (invalid API key)
- 422: Validation Error (malformed request)
- 429: Rate Limit Exceeded
- 500: Internal Server Error
- 503: Service Unavailable
- Store API keys in environment variables
- Rotate keys regularly
- Use separate keys for development/production
- Monitor key usage for suspicious activity
- Per-client rate limits
- Configurable limits and windows
- IP-based and API key-based limiting
- Automatic blocking of abusive clients
- Strict URL validation and sanitization
- Content size limits
- Content type verification
- Request payload validation
- CORS configuration
- HTTPS enforcement
- Content Security Policy
- XSS Protection
- HSTS configuration
- Sanitized error messages
- No sensitive data in responses
- Detailed internal logging
- Request tracing
- Endpoint:
/health
- Checks:
- API availability
- Dependencies status
- Resource usage
- Response times
- Prometheus endpoint:
:9090/metrics
- Key metrics:
- Request rates and latencies
- Error rates
- Resource utilization
- Cache hit rates
- Structured JSON logging
- Configurable log levels
- Request/response logging
- Error tracking integration
- Rate Limiting
Issue: "Rate limit exceeded"
Solution:
- Check current rate limits in configuration
- Implement request batching
- Consider upgrading limits
- Content Extraction
Issue: "Failed to extract content"
Solutions:
- Verify URL accessibility
- Check JavaScript rendering requirements
- Adjust timeout settings
- Validate HTML structure
- LLM Processing
Issue: "LLM processing failed"
Solutions:
- Verify API key validity
- Check input content size
- Review content format
- Adjust retry settings
- Response Times
- Enable caching
- Optimize content fetching
- Configure timeouts
- Use connection pooling
- Resource Usage
- Monitor memory usage
- Adjust worker counts
- Configure resource limits
- Implement request queuing
- Error Rates
- Implement circuit breakers
- Add retry mechanisms
- Monitor error patterns
- Adjust validation rules
Please see CONTRIBUTING.md for guidelines on how to contribute to this project.
This project is licensed under the MIT License - see the LICENSE file for details.
This API is configured for easy deployment to Render using Docker. Follow these steps:
- Fork or clone this repository to your GitHub account
- Create a new Web Service on Render
- Connect your GitHub repository
- Select "Docker" as the environment
- Choose the "starter" (free) or "standard" plan
- Set the following environment variables in the Render dashboard:
GEMINI_API_KEY
: Your Google Gemini API key- Other environment variables are automatically set via
render.yaml
The service will automatically:
- Build using the production Dockerfile
- Run health checks at
/health
- Scale with multiple workers
- Handle HTTPS and domain configuration
-
Rate Limiting: The API includes built-in rate limiting (100 requests per minute by default)
-
Security:
- All endpoints use HTTPS
- CORS is configured (update
ALLOWED_ORIGINS
for your domains) - Security headers are enabled
- API key authentication can be enabled via
X-API-Key
header
-
Performance:
- Uses multiple worker processes
- Caches Playwright browser instances
- Implements retry logic for failed requests
- Optimized Docker image size
-
Monitoring:
- Health check endpoint at
/health
- Prometheus metrics at
/metrics
- Detailed logging with configurable levels
- Request tracing for debugging
- Health check endpoint at
-
Scaling:
- Stateless design allows horizontal scaling
- Configure
workers
count based on CPU cores - Adjust rate limits as needed
- Monitor resource usage through Render dashboard
Common deployment issues and solutions:
-
Playwright Issues:
- Ensure browser dependencies are installed
- Check browser cache permissions
- Verify timeout settings
-
Memory Usage:
- Monitor worker memory consumption
- Adjust worker count if needed
- Consider upgrading plan for more resources
-
Rate Limiting:
- Check logs for rate limit errors
- Adjust limits in environment variables
- Implement client-side retry logic
-
API Integration:
- Verify CORS settings
- Check API key configuration
- Monitor request/response patterns
For more detailed logs and metrics, check the Render dashboard or enable debug logging by setting LOG_LEVEL=debug
.