Cohere Event Scraper API

A robust API for extracting structured event information from web pages using Google's Gemini LLM technology.

Overview

This API provides a powerful solution for extracting event information (title, description, date/time, location) from web pages using a combination of advanced web scraping techniques and Google's Gemini LLM. It handles both simple static websites and complex JavaScript-rendered pages, making it suitable for a wide range of use cases.

Key Features

Flexible Content Fetching
- Basic HTTP content fetching for static sites
- Advanced browser-based scraping using Playwright for dynamic content
- Configurable timeouts and retry mechanisms
Intelligent Processing
- Smart HTML preprocessing and cleaning
- Markdown conversion for better LLM processing
- Google Gemini LLM integration for accurate information extraction
Production-Ready Architecture
- FastAPI-based REST API with automatic validation
- Comprehensive error handling and logging
- Rate limiting and security features
- Docker containerization with multi-stage builds
- Health checks and monitoring
- Configurable via environment variables

Project Structure

.
├── app/                  # Main application code
│   ├── api/             # FastAPI routes and models
│   │   ├── models.py    # Pydantic models for validation
│   │   └── routes.py    # API endpoint definitions
│   ├── core/            # Core business logic
│   │   ├── fetchers.py  # Content fetching strategies
│   │   ├── llm.py      # Gemini LLM integration
│   │   └── parser.py    # Response parsing and validation
│   └── main.py         # Application entry point
├── tests/              # Test suite
│   ├── conftest.py    # Test configuration
│   ├── test_*.py      # Test modules
├── docker/            # Docker configuration
├── scripts/          # Utility scripts
├── .env.example      # Example environment variables
├── .env.prod.example # Production environment template
├── Dockerfile        # Development container
├── Dockerfile.prod   # Production container
├── docker-compose.yml # Container orchestration
├── requirements.txt   # Production dependencies
└── requirements-dev.txt # Development dependencies

Quick Start

Prerequisites

Python 3.9+
Docker and Docker Compose
Google Gemini API key

Local Development

Clone and setup:

git clone https://github.com/yourusername/cohere-event-scraper.git
cd cohere-event-scraper
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt
pip install -r requirements-dev.txt  # For development

Configure environment:

cp .env.example .env
# Edit .env with your settings

Run development server:

uvicorn app.main:app --reload

Docker Development

# Start development environment
docker-compose up app-dev

# Run tests
docker-compose run --rm app-dev pytest

# Code quality checks
docker-compose run --rm app-dev pre-commit run --all-files

Production Deployment

Configure production settings:

cp .env.prod.example .env.prod
# Edit .env.prod with production values

Deploy with Docker:

docker-compose -f docker-compose.yml up app-prod -d

API Reference

POST /api/scrape

Extract event information from a webpage.

Request

{
  "url": "https://example.com/event-page",
  "gemini_api_key": "your_api_key",
  "use_playwright": false,
  "custom_instructions": "Optional instructions for the LLM",
  "timeout": 30000
}

Response

{
  "title": "Sample Event Title",
  "description": "Detailed event description...",
  "start_datetime": "2024-07-15T10:00:00Z",
  "end_datetime": "2024-07-15T12:00:00Z",
  "location": "123 Main St, Example City",
  "metadata": {
    "confidence_score": 0.95,
    "extraction_method": "gemini_llm",
    "processing_time_ms": 1234
  }
}

Error Responses

{
  "error": "error_type",
  "message": "Human-readable error message",
  "details": {
    "technical_details": "Additional error context",
    "request_id": "unique_request_identifier"
  }
}

Status codes:

400: Bad Request (invalid input)
401: Unauthorized (invalid API key)
422: Validation Error (malformed request)
429: Rate Limit Exceeded
500: Internal Server Error
503: Service Unavailable

Security Best Practices

API Key Management

Store API keys in environment variables
Rotate keys regularly
Use separate keys for development/production
Monitor key usage for suspicious activity

Rate Limiting

Per-client rate limits
Configurable limits and windows
IP-based and API key-based limiting
Automatic blocking of abusive clients

Input Validation

Strict URL validation and sanitization
Content size limits
Content type verification
Request payload validation

Security Headers

CORS configuration
HTTPS enforcement
Content Security Policy
XSS Protection
HSTS configuration

Error Handling

Sanitized error messages
No sensitive data in responses
Detailed internal logging
Request tracing

Monitoring

Health Checks

Endpoint: /health
Checks:
- API availability
- Dependencies status
- Resource usage
- Response times

Metrics

Prometheus endpoint: :9090/metrics
Key metrics:
- Request rates and latencies
- Error rates
- Resource utilization
- Cache hit rates

Logging

Structured JSON logging
Configurable log levels
Request/response logging
Error tracking integration

Troubleshooting Guide

Common Issues

Rate Limiting

Issue: "Rate limit exceeded"
Solution:
- Check current rate limits in configuration
- Implement request batching
- Consider upgrading limits

Content Extraction

Issue: "Failed to extract content"
Solutions:
- Verify URL accessibility
- Check JavaScript rendering requirements
- Adjust timeout settings
- Validate HTML structure

LLM Processing

Issue: "LLM processing failed"
Solutions:
- Verify API key validity
- Check input content size
- Review content format
- Adjust retry settings

Performance Optimization

Response Times

Enable caching
Optimize content fetching
Configure timeouts
Use connection pooling

Resource Usage

Monitor memory usage
Adjust worker counts
Configure resource limits
Implement request queuing

Error Rates

Implement circuit breakers
Add retry mechanisms
Monitor error patterns
Adjust validation rules

Contributing

Please see CONTRIBUTING.md for guidelines on how to contribute to this project.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Deployment

Deploying to Render

This API is configured for easy deployment to Render using Docker. Follow these steps:

Fork or clone this repository to your GitHub account
Create a new Web Service on Render
Connect your GitHub repository
Select "Docker" as the environment
Choose the "starter" (free) or "standard" plan
Set the following environment variables in the Render dashboard:
- GEMINI_API_KEY: Your Google Gemini API key
- Other environment variables are automatically set via render.yaml

The service will automatically:

Build using the production Dockerfile
Run health checks at /health
Scale with multiple workers
Handle HTTPS and domain configuration

Production Considerations

Rate Limiting: The API includes built-in rate limiting (100 requests per minute by default)
Security:
- All endpoints use HTTPS
- CORS is configured (update ALLOWED_ORIGINS for your domains)
- Security headers are enabled
- API key authentication can be enabled via X-API-Key header
Performance:
- Uses multiple worker processes
- Caches Playwright browser instances
- Implements retry logic for failed requests
- Optimized Docker image size
Monitoring:
- Health check endpoint at /health
- Prometheus metrics at /metrics
- Detailed logging with configurable levels
- Request tracing for debugging
Scaling:
- Stateless design allows horizontal scaling
- Configure workers count based on CPU cores
- Adjust rate limits as needed
- Monitor resource usage through Render dashboard

Troubleshooting

Common deployment issues and solutions:

Playwright Issues:
- Ensure browser dependencies are installed
- Check browser cache permissions
- Verify timeout settings
Memory Usage:
- Monitor worker memory consumption
- Adjust worker count if needed
- Consider upgrading plan for more resources
Rate Limiting:
- Check logs for rate limit errors
- Adjust limits in environment variables
- Implement client-side retry logic
API Integration:
- Verify CORS settings
- Check API key configuration
- Monitor request/response patterns

For more detailed logs and metrics, check the Render dashboard or enable debug logging by setting LOG_LEVEL=debug.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.cursor/rules		.cursor/rules
.github/workflows		.github/workflows
app		app
debug		debug
docker		docker
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.env.prod.example		.env.prod.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.windsurfrules		.windsurfrules
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
Dockerfile.prod		Dockerfile.prod
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
package-lock.json		package-lock.json
package.json		package.json
pyproject.toml		pyproject.toml
render.yaml		render.yaml
requirements-vercel.txt		requirements-vercel.txt
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
vercel-build.sh		vercel-build.sh
vercel.json		vercel.json

License

Woven-Web/cohere-api

Folders and files

Latest commit

History

Repository files navigation

Cohere Event Scraper API

Overview

Key Features

Project Structure

Quick Start

Prerequisites

Local Development

Docker Development

Production Deployment

API Reference

POST /api/scrape

Request

Response

Error Responses

Security Best Practices

API Key Management

Rate Limiting

Input Validation

Security Headers

Error Handling

Monitoring

Health Checks

Metrics

Logging

Troubleshooting Guide

Common Issues

Performance Optimization

Contributing

License

Deployment

Deploying to Render

Production Considerations

Troubleshooting

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages