Production-scale data infrastructure for U.S. legislative transparency
This GitHub organization maintains the backend data pipelines that power Windy Civi — an open-source initiative to make legislative data permanent, verifiable, and accessible.
We build reliable, reproducible pipelines that process legislative data from all 50 states and Congress, creating a blockchain-style archive that survives administration changes, website redesigns, and data loss.
Each state repository in this organization runs automated workflows that:
- 🔁 Scrape legislative data nightly using OpenStates scrapers
- 🧼 Sanitize output by removing ephemeral fields for deterministic versioning
- 🧠 Structure data in blockchain-style directories with full provenance tracking
- 🔗 Link bills, votes, and events across sessions and chambers
- 📄 Extract full text from PDFs, XMLs, and HTML (multi-format fallback)
- 📊 Monitor data quality with automated orphan detection
- 📂 Commit processed data to GitHub for permanent, auditable storage
Result: A complete, versioned legislative record for each state — resilient to source changes and always accessible.
✅ Incremental Processing - Only processes new/changed bills (no wasted compute)
✅ Auto-Save Failsafe - Commits every 30 minutes to survive 6-hour GitHub Actions timeout
✅ Data Quality Monitoring - Tracks orphaned bills (votes without bill data) to catch scraper issues
✅ Concurrent Updates - Multiple jobs can safely update repos with git rebase conflict resolution
✅ Multi-Format Text Extraction - XML → HTML → PDF fallback with strikethrough detection
✅ Fault Tolerance - Individual bill failures don't block the entire pipeline
- Orchestration: GitHub Actions (Docker + Python)
- Language: Python 3.12+ (pipenv, black, type hints)
- Text Extraction: pdfplumber, PyPDF2, BeautifulSoup4, lxml
- Data Format: JSON with ISO timestamps and semantic versioning
- Storage: Git (GitHub) for versioning + permanent URLs
| Jurisdiction | Status | Bills Tracked | Last Updated |
|---|---|---|---|
| 🇺🇸 Federal (USA) | ✅ Production | ~15,000+ | Daily |
| 🏛️ Illinois (IL) | ✅ Production | ~8,000+ | Daily |
| 🏛️ Tennessee (TN) | ✅ Production | ~2,000+ | Daily |
| 🏛️ Texas (TX) | ✅ Production | ~7,000+ | Daily |
| 🏛️ Wisconsin (WI) | ✅ Production | ~1,500+ | Daily |
| 🏛️ Wyoming (WY) | ✅ Production | ~500+ | Daily |
| ... | 🚧 Onboarding | - | - |
Template Repository: windy-civi-template-pipeline
Each state repository produces structured, version-controlled output:
data_output/
├── data_processed/
│ └── country:us/state:il/
│ └── sessions/103/
│ ├── bills/
│ │ └── HB1234/
│ │ ├── metadata.json # Bill data + processing timestamps
│ │ ├── files/ # Extracted text
│ │ │ ├── HB1234_text.xml
│ │ │ ├── HB1234_text.pdf
│ │ │ └── HB1234_text_extracted.txt
│ │ └── logs/ # Actions, votes, events
│ └── events/ # Committee hearings
├── orphaned_placeholders_tracking.json # Data quality monitoring
└── data_not_processed/ # Error logs by category
Key Features:
_processingtimestamps track when logs and text were last updated- Incremental updates only touch changed bills
- Orphan tracking identifies bills with votes/events but no metadata
- Errors categorized by type (download failures, parsing errors, etc.)
Pipelines run as modular GitHub Actions in two stages:
- Scrape - Docker-based OpenStates scraper
- Format - Process data, link events, monitor quality
- Runs separately to handle 4-6 hour extraction times
- Auto-saves progress every 30 minutes
- Resumes on restart if timeout occurs
Why this design?
- Decouples fast metadata updates from slow text extraction
- Prevents scraping from blocking on extraction timeouts
- Enables independent debugging of each stage
- Allows selective re-runs (e.g., re-extract without re-scraping)
Patterns Implemented:
- Incremental processing with timestamp-based change detection
- Two-level timestamps (bill-level + action-level)
- Fault-tolerant design (individual failures don't block pipeline)
- Data quality monitoring with orphan detection
- Auto-save for long-running processes
- Git-based conflict resolution for concurrent updates
Code Quality:
- Type hints throughout
- Black code formatting
- Modular handler/utility architecture
- Comprehensive error logging
- Unit + integration tests
Each pipeline includes:
- ✅ Test suite with synthetic and real data
- ✅ Incremental processing validation (100% skip rate on re-runs)
- ✅ Orphan detection with multi-run simulation
- ✅ Text extraction with multi-format fallback testing
Example: USA pipeline test with 205 bills - all placeholders cleaned, 0 orphans detected.
Comprehensive docs in the main repository:
- Main Repository - Full technical documentation
- Setup Guide - State pipeline setup
- Incremental Processing - How updates work
- Orphan Tracking - Data quality monitoring
We're building automated social media bots that turn legislative data into actionable content for organizations, advocacy groups, and elected officials.
Why bots? We built a civic engagement app first - and nobody used it. The lesson: don't make people come to your platform. Meet people where they already are (Twitter, BlueSky, etc.). Instead of building another civic tech app that sits unused, we're integrating legislative transparency into the feeds people already follow.
Organizations sign up and configure:
- Topics of interest (keywords, policy areas, bill sponsors)
- Social platforms (BlueSky, Twitter/X, etc.)
- Posting preferences (frequency, format, tone)
The system automatically:
- Monitors legislative activity across all states
- Matches new bills/actions to org-specific topics
- Generates shareable posts (summaries, links, impact notes)
- Publishes to configured social accounts
Advocacy Organizations:
- Black Lives Matter → Auto-post bills about criminal justice reform, police accountability, voting rights
- LGBTQIA+ Groups → Track legislation affecting LGBTQ+ rights, healthcare access, discrimination laws
- Environmental Orgs → Monitor climate policy, renewable energy bills, conservation funding
Elected Officials:
- State Senators → Auto-post when they sponsor or co-sponsor legislation
- U.S. Congress → Share bills they vote on with constituent-friendly summaries
- Local Representatives → Keep constituents informed about state-level impacts
Researchers & Journalists:
- Policy Think Tanks → Track legislation in specific domains (education, healthcare, housing)
- News Outlets → Get alerts on high-impact bills before they become news
- Data Pipeline → Provides clean, structured, timestamped bill data
- AI Layer → Summarizes bills, extracts themes, identifies relevance
- Bot Engine → Generates platform-specific posts, manages posting schedules
- API → Allows orgs to configure topics, approve/reject posts, view analytics
Why this matters: Most people don't read legislative data, but they do follow organizations they trust on social media. This bridges that gap — making legislative transparency automatic, accessible, and actionable.
Phase 1: Infrastructure ✅
- ✅ Incremental processing (complete)
- ✅ Auto-save failsafe (complete)
- ✅ Orphan tracking (complete)
- 🚧 Expanding to all 50 states
- 🚧 Historical data backfill
Phase 2: Intelligence Layer 🚧
- 🚧 AI-powered bill summarization
- 🚧 Topic classification and keyword matching
- 🚧 Impact analysis (who's affected, how)
- 🚧 Automated post generation
Phase 3: Engagement Platform 📋
- 📋 Organization signup and configuration
- 📋 Social media API integrations
- 📋 Post approval workflows
- 📋 Analytics dashboard
Phase 4: Decentralization 🔮
- Integration with decentralized storage (IPFS)
- Blockchain-based provenance tracking
- Federated bot hosting
This project was built with AI assistance (ChatGPT, Cursor) as technical collaborators. AI helped with:
- Architecture exploration and design tradeoffs
- Workflow refactoring and optimization
- Documentation and error handling
- Debugging complex edge cases
The result: more maintainable code, better documentation, and faster iteration — while deepening my understanding of distributed systems, data pipelines, and production-grade automation.
Interested in civic tech? We welcome:
- 🏛️ State pipeline onboarding - Help add your state
- 🐛 Bug reports - Found an issue? Open a ticket
- 💡 Feature ideas - Suggest improvements
- 📖 Documentation - Help clarify setup or usage
Getting Started:
- Check out the main repository
- Review the setup guide
- Open an issue or PR
- Main Project: Windy Civi
- Website: windycivi.com
- Organization: Chicago-based civic tech initiative
Building open, durable civic infrastructure — one state at a time. 🏛️
Part of the Windy Civi ecosystem.