A Zig-based RSS feed parser with local caching and HTML content extraction for the Munich bike reporting platform.
- Fetches RSS feed from
https://meldeplattform-rad.muenchenunterwegs.de/bms/rss
- Parses RSS XML to extract item metadata
- Parallel processing with configurable worker threads (4 workers by default)
- Fetches individual HTML pages for each RSS item concurrently
- Local file-based caching with 24-hour expiration
- Extracts meaningful content from HTML pages
- Image extraction and base64 encoding from
.bms-attachments
sections - Outputs structured JSON data with embedded images
- Retry logic with exponential backoff for network requests
- Memory-safe implementation using Zig's explicit memory management
- Performance optimizations with thread-safe operations and load balancing
├── build.zig # Zig build configuration
├── src/
│ ├── main.zig # Entry point and orchestration
│ ├── rss_parser.zig # RSS XML parsing
│ ├── http_client.zig # HTTP client with retry logic
│ ├── cache.zig # File-based caching system
│ ├── item_fetcher.zig # HTML content fetching and parsing
│ └── json_output.zig # JSON serialization
└── cache/ # Cache directory (created at runtime)
- Zig 0.11.0 or later
zig build
Sequential version (original):
zig build run
# or
./zig-out/bin/rss-cache-parser [OPTIONS]
Parallel version (optimized):
zig build run-parallel
# or
./zig-out/bin/rss-cache-parser-parallel [OPTIONS]
Both versions support the following command-line options:
# Show help
./zig-out/bin/rss-cache-parser-parallel --help
./zig-out/bin/rss-cache-parser-parallel -h
# Save JSON output to file
./zig-out/bin/rss-cache-parser-parallel --output munich_reports.json
./zig-out/bin/rss-cache-parser-parallel -o data.json
# Default behavior (output to stdout)
./zig-out/bin/rss-cache-parser-parallel
python3 benchmark.py
- RSS Parsing: Fetches and parses the RSS feed to extract item URLs and metadata
- Caching Strategy:
- Uses SHA-256 hash of URL as cache filename
- Stores cached data as JSON with timestamp
- Checks file modification time for 7-day expiration
- HTML Processing:
- Fetches individual item HTML pages
- Extracts meaningful content (meta description, title, paragraphs)
- Image extraction from
.bms-attachments
CSS class - Base64 encoding of images from
imbo.werdenktwas.de
domain - Falls back to text content extraction if structured data unavailable
- Output: Generates structured JSON with all processed items
Each cached item is stored as JSON:
{
"timestamp": 1672531200,
"url": "https://example.com/item/123",
"html_content": "<html>...</html>",
"title": "Item Title",
"pub_date": "Wed, 02 Jul 2025 18:32:19 +0000"
}
The final JSON output contains an array of processed items:
[
{
"id": 2067877,
"title": "Vergessene Bake",
"url": "https://meldeplattform-rad.muenchenunterwegs.de/bms/2067877",
"pub_date": "Wed, 09 Jul 2025 19:19:59 +0000",
"creation_date": "09.07.2025",
"address": "Adolf-Kolping-Straße 10, 80336 München",
"borough": "Ludwigsvorstadt-Isarvorstadt",
"description": "Hier steht seit Monaten eine vergessene Bake...",
"images": [
{
"url": "https://imbo.werdenktwas.de/users/prod-wdw/images/5rvUOwm6OLqEqHrel1ynxoA_8v5XfK9T.jpg?...",
"base64_data": "/9j/4AAQSkZJRgABAQEASABIAAD/2wBDAAEBAQEBAQEBAQEBAQEBAQIBAQEBAQIBAQECAgICAgICAgIDAwQDAwMDAwICAwQDAwQEBAQEAgMFBQQEBQQEBAT/..."
}
],
"cached": false,
"html_length": 27750
}
]
- Network failures: Automatic retry with exponential backoff (up to 3 attempts)
- Parse failures: Logged and skipped, processing continues with other items
- Cache failures: Falls back to direct fetch
- Graceful degradation: Returns partial results if some items fail
- Parallel Processing: Multi-threaded execution with configurable worker count
- Memory Management: Uses arena allocators for temporary data, explicit cleanup
- Caching: Reduces network load with 7-day cache expiration
- Thread Safety: Mutex-protected operations for concurrent access
- Load Balancing: Optimal work distribution across worker threads
- Resource Limits: Configurable limits for HTTP response sizes
- Performance Monitoring: Built-in timing and throughput metrics
- Image Processing: Concurrent image fetching and base64 encoding
Parallel processing provides significant speedup for RSS feeds with many items:
- 2x speedup with 2 workers on typical workloads
- 3-4x speedup with 4 workers on I/O-bound operations
- Automatic fallback to sequential processing for small workloads
- Efficient scaling up to the number of available CPU cores
# Build the Docker image
docker build -t meldeplattform-scraper .
# Run with output to stdout
docker run --rm meldeplattform-scraper
# Save output to file (with volume mount)
docker run --rm -v $(pwd)/output:/app/output meldeplattform-scraper --output /app/output/munich_reports.json
# Run with docker-compose
docker-compose up
# Pull from GitHub Container Registry
docker pull ghcr.io/[username]/meldeplattform-scraper:latest
GitHub Actions automatically builds and publishes Docker images to GitHub Container Registry on every push to main branch.
Key constants can be modified in the source files:
CACHE_EXPIRY_HOURS
incache.zig
(default: 168 hours / 7 days)RSS_URL
inmain.zig
- HTTP timeout and retry settings in
http_client.zig
Option | Short | Description | Example |
---|---|---|---|
--help |
-h |
Show usage information | ./scraper -h |
--output <file> |
-o <file> |
Save JSON to file instead of stdout | ./scraper -o data.json |
- Zig standard library only
- No external dependencies required
- Uses built-in HTTP client, JSON parser, and crypto functions
This project is licensed under the Open Software License 3.0 (OSL-3.0). See the LICENSE file for details.
Licensed under the Open Software License version 3.0