RSS Cache Parser

A Zig-based RSS feed parser with local caching and HTML content extraction for the Munich bike reporting platform.

Features

Fetches RSS feed from https://meldeplattform-rad.muenchenunterwegs.de/bms/rss
Parses RSS XML to extract item metadata
Parallel processing with configurable worker threads (4 workers by default)
Fetches individual HTML pages for each RSS item concurrently
Local file-based caching with 24-hour expiration
Extracts meaningful content from HTML pages
Image extraction and base64 encoding from .bms-attachments sections
Outputs structured JSON data with embedded images
Retry logic with exponential backoff for network requests
Memory-safe implementation using Zig's explicit memory management
Performance optimizations with thread-safe operations and load balancing

Project Structure

├── build.zig              # Zig build configuration
├── src/
│   ├── main.zig          # Entry point and orchestration
│   ├── rss_parser.zig    # RSS XML parsing
│   ├── http_client.zig   # HTTP client with retry logic
│   ├── cache.zig         # File-based caching system
│   ├── item_fetcher.zig  # HTML content fetching and parsing
│   └── json_output.zig   # JSON serialization
└── cache/                # Cache directory (created at runtime)

Installation & Usage

Prerequisites

Zig 0.11.0 or later

Build

zig build

Run

Sequential version (original):

zig build run
# or
./zig-out/bin/rss-cache-parser [OPTIONS]

Parallel version (optimized):

zig build run-parallel
# or
./zig-out/bin/rss-cache-parser-parallel [OPTIONS]

CLI Options

Both versions support the following command-line options:

# Show help
./zig-out/bin/rss-cache-parser-parallel --help
./zig-out/bin/rss-cache-parser-parallel -h

# Save JSON output to file
./zig-out/bin/rss-cache-parser-parallel --output munich_reports.json
./zig-out/bin/rss-cache-parser-parallel -o data.json

# Default behavior (output to stdout)
./zig-out/bin/rss-cache-parser-parallel

Performance Benchmark

python3 benchmark.py

How It Works

RSS Parsing: Fetches and parses the RSS feed to extract item URLs and metadata
Caching Strategy:
- Uses SHA-256 hash of URL as cache filename
- Stores cached data as JSON with timestamp
- Checks file modification time for 7-day expiration
HTML Processing:
- Fetches individual item HTML pages
- Extracts meaningful content (meta description, title, paragraphs)
- Image extraction from .bms-attachments CSS class
- Base64 encoding of images from imbo.werdenktwas.de domain
- Falls back to text content extraction if structured data unavailable
Output: Generates structured JSON with all processed items

Cache Structure

Each cached item is stored as JSON:

{
  "timestamp": 1672531200,
  "url": "https://example.com/item/123",
  "html_content": "<html>...</html>",
  "title": "Item Title",
  "pub_date": "Wed, 02 Jul 2025 18:32:19 +0000"
}

Output Format

The final JSON output contains an array of processed items:

[
  {
    "id": 2067877,
    "title": "Vergessene Bake",
    "url": "https://meldeplattform-rad.muenchenunterwegs.de/bms/2067877",
    "pub_date": "Wed, 09 Jul 2025 19:19:59 +0000",
    "creation_date": "09.07.2025",
    "address": "Adolf-Kolping-Straße 10, 80336 München",
    "borough": "Ludwigsvorstadt-Isarvorstadt",
    "description": "Hier steht seit Monaten eine vergessene Bake...",
    "images": [
      {
        "url": "https://imbo.werdenktwas.de/users/prod-wdw/images/5rvUOwm6OLqEqHrel1ynxoA_8v5XfK9T.jpg?...",
        "base64_data": "/9j/4AAQSkZJRgABAQEASABIAAD/2wBDAAEBAQEBAQEBAQEBAQEBAQIBAQEBAQIBAQECAgICAgICAgIDAwQDAwMDAwICAwQDAwQEBAQEAgMFBQQEBQQEBAT/..."
      }
    ],
    "cached": false,
    "html_length": 27750
  }
]

Error Handling

Network failures: Automatic retry with exponential backoff (up to 3 attempts)
Parse failures: Logged and skipped, processing continues with other items
Cache failures: Falls back to direct fetch
Graceful degradation: Returns partial results if some items fail

Performance Features

Parallel Processing: Multi-threaded execution with configurable worker count
Memory Management: Uses arena allocators for temporary data, explicit cleanup
Caching: Reduces network load with 7-day cache expiration
Thread Safety: Mutex-protected operations for concurrent access
Load Balancing: Optimal work distribution across worker threads
Resource Limits: Configurable limits for HTTP response sizes
Performance Monitoring: Built-in timing and throughput metrics
Image Processing: Concurrent image fetching and base64 encoding

Performance Improvements

Parallel processing provides significant speedup for RSS feeds with many items:

2x speedup with 2 workers on typical workloads
3-4x speedup with 4 workers on I/O-bound operations
Automatic fallback to sequential processing for small workloads
Efficient scaling up to the number of available CPU cores

Docker Usage

Build and Run with Docker

# Build the Docker image
docker build -t meldeplattform-scraper .

# Run with output to stdout
docker run --rm meldeplattform-scraper

# Save output to file (with volume mount)
docker run --rm -v $(pwd)/output:/app/output meldeplattform-scraper --output /app/output/munich_reports.json

# Run with docker-compose
docker-compose up

# Pull from GitHub Container Registry
docker pull ghcr.io/[username]/meldeplattform-scraper:latest

Automated Builds

GitHub Actions automatically builds and publishes Docker images to GitHub Container Registry on every push to main branch.

Configuration

Key constants can be modified in the source files:

CACHE_EXPIRY_HOURS in cache.zig (default: 168 hours / 7 days)
RSS_URL in main.zig
HTTP timeout and retry settings in http_client.zig

CLI Options Reference

Option	Short	Description	Example
`--help`	`-h`	Show usage information	`./scraper -h`
`--output <file>`	`-o <file>`	Save JSON to file instead of stdout	`./scraper -o data.json`

Dependencies

Zig standard library only
No external dependencies required
Uses built-in HTTP client, JSON parser, and crypto functions

License

This project is licensed under the Open Software License 3.0 (OSL-3.0). See the LICENSE file for details.

Licensed under the Open Software License version 3.0

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github/workflows		.github/workflows
docs		docs
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
build.zig		build.zig
complete_test.json		complete_test.json
docker-compose.yml		docker-compose.yml
install_libvips.sh		install_libvips.sh
parallel_test.json		parallel_test.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RSS Cache Parser

Features

Project Structure

Installation & Usage

Prerequisites

Build

Run

CLI Options

Performance Benchmark

How It Works

Cache Structure

Output Format

Error Handling

Performance Features

Performance Improvements

Docker Usage

Build and Run with Docker

Automated Builds

Configuration

CLI Options Reference

Dependencies

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

rmoriz/meldeplattform-scraper

Folders and files

Latest commit

History

Repository files navigation

RSS Cache Parser

Features

Project Structure

Installation & Usage

Prerequisites

Build

Run

CLI Options

Performance Benchmark

How It Works

Cache Structure

Output Format

Error Handling

Performance Features

Performance Improvements

Docker Usage

Build and Run with Docker

Automated Builds

Configuration

CLI Options Reference

Dependencies

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages