A GitHub Actions-powered pipeline that scrapes, cleans, versions, and extracts text from state legislative data from Open States. This repository acts as a standardized template for all state-level pipelines within the Windy Civi ecosystem.
Each state pipeline provides a self-contained automation workflow to:
- π§Ή Scrape data for a single U.S. state from the OpenStates project
- π§Ό Sanitize the data by removing ephemeral fields (
_id,scraped_at) for deterministic output - π§ Format it into a blockchain-style, versioned structure with incremental processing
- π Link events to bills and sessions automatically
- π©Ί Monitor data quality by tracking orphaned bills
- π Extract full text from bills, amendments, and supporting documents (PDFs, XMLs, HTMLs)
- π Commit the formatted output and extracted text nightly (or manually) with auto-save
This approach keeps every state repository consistent, auditable, and easy to maintain.
- π Incremental Processing - Only processes new or updated bills (no duplicate work!)
- πΎ Auto-Save Failsafe - Commits progress every 30 minutes during text extraction
- π©Ί Data Quality Monitoring - Tracks orphaned bills (votes/events without bill data)
- π Bill-Event Linking - Automatically connects committee hearings and events to bills
- β±οΈ Timestamp Tracking - Two-level timestamps for logs and text extraction
- π― Multi-Format Text Extraction - XML β HTML β PDF with fallbacks
- π Concurrent Job Support - Multiple runs can safely update the same repository
- π Detailed Error Logging - Categorized errors for easy debugging
-
Click the green "Use this template" button on this repository page to create a new repository from this template.
-
Name your new repository using the convention:
New Mexico Data Pipeline(e.g.,il-data-pipeline,tx-data-pipeline). -
Update the state abbreviation in both workflow files:
In
.github/workflows/scrape-and-format-data.yml:env: STATE_CODE: nm # CHANGE THIS to your state abbreviation jobs: scrape: - name: Scrape data uses: windy-civi/toolkit/actions/scrape@main with: state: ${{ env.STATE_CODE }} format: - name: Format data uses: windy-civi/toolkit/actions/format@main with: state: ${{ env.STATE_CODE }}
In
.github/workflows/extract-text.yml:- name: Extract text uses: windy-civi/toolkit/actions/extract@main with: state: nm # CHANGE THIS to your state abbreviation
Make sure the state abbreviation matches the folder name used in Open States scrapers.
-
Enable GitHub Actions in your repo (if not already enabled).
-
(Optional) Enable nightly runs by ensuring the schedule blocks are uncommented in both workflow files:
on: workflow_dispatch: schedule: - cron: "0 1 * * *" # For scrape-and-format-data.yml # or - cron: "0 3 * * *" # For extract-text.yml (runs later to avoid overlap)
The pipeline runs in two stages:
Two separate jobs that run sequentially:
- Scrape Job - Downloads legislative data using OpenStates scrapers
- Format Job - Processes scraped data, links events, and monitors quality
Independent workflow that extracts full bill text from documents.
This separation allows:
- β Faster metadata updates
- β Independent monitoring and debugging
- β Text extraction can timeout and restart without affecting scraping
- β Better resource management (text extraction can take hours)
New Mexico Data Pipeline/
βββ .github/workflows/
β βββ scrape-and-format-data.yml # Metadata scraping + formatting
β βββ extract-text.yml # Text extraction (independent)
βββ country:us/
β βββ state:xx/ # state:usa for federal, state:il for Illinois, etc.
β βββ sessions/
β βββ {session_id}/
β βββ bills/
β β βββ {bill_id}/
β β βββ metadata.json # Bill data + _processing timestamps
β β βββ files/ # Extracted text & documents
β β β βββ *.pdf # Original PDFs
β β β βββ *.xml # Original XMLs
β β β βββ *_extracted.txt # Extracted text
β β βββ logs/ # Action/event/vote logs
β βββ events/ # Committee hearings
β βββ {timestamp}_hearing.json
βββ .windycivi/ # Pipeline metadata (committed)
β βββ errors/ # Processing errors
β β βββ text_extraction_errors/ # Text extraction failures
β β β βββ download_failures/ # Failed downloads
β β β βββ parsing_errors/ # Failed text parsing
β β β βββ missing_files/ # Missing source files
β β βββ missing_session/ # Bills without session info
β β βββ event_archive/ # Archived event data
β β βββ orphaned_placeholders_tracking.json # Data quality monitoring
β βββ bill_session_mapping.json # Bill-to-session mappings (flattened)
β βββ sessions.json # Session metadata (flattened)
β βββ latest_timestamp_seen.txt # Last processed timestamp
βββ Pipfile, Pipfile.lock
βββ README.md
Formatted metadata is saved to country:us/state:xx/sessions/, organized by session and bill.
Each bill directory contains:
metadata.jsonβ structured information about the bill with_processingtimestampslogs/β action, event, and vote logsfiles/β original documents and extracted text
Example metadata.json structure:
{
"identifier": "HB 1234",
"title": "Example Bill",
"_processing": {
"logs_latest_update": "2025-01-15T14:30:00Z",
"text_extraction_latest_update": "2025-01-16T08:00:00Z"
},
"actions": [
{
"description": "Introduced in House",
"date": "2025-01-01",
"_processing": {
"log_file_created": "2025-01-01T12:00:00Z"
}
}
]
}When text extraction is enabled, each bill directory also includes:
files/β original documents and extracted text*.pdfβ Original PDF documents*.xmlβ Original XML bill text*.htmlβ Original HTML documents*_extracted.txtβ Plain text extracted from documents
Failed items are logged separately:
.windycivi/errors/text_extraction_errors/download_failures/β Documents that couldn't be downloaded.windycivi/errors/text_extraction_errors/parsing_errors/β Documents that couldn't be parsed.windycivi/errors/text_extraction_errors/missing_files/β Bills missing source files.windycivi/errors/missing_session/β Bills without session information
The pipeline automatically tracks orphaned bills - bills that have vote events or hearings but no actual bill data. Check this file periodically to identify data quality issues:
{
"HB999": {
"first_seen": "2025-01-21T12:00:00Z",
"last_seen": "2025-01-23T14:30:00Z",
"occurrence_count": 3,
"session": "103",
"vote_count": 2,
"event_count": 0,
"path": "country:us/state:il/sessions/103/bills/HB999"
}
}What to look for:
- Bills with high
occurrence_count(3+) are chronic orphans - likely data quality issues - Check for typos in bill identifiers or scraper configuration
- Orphans automatically resolve when the bill data arrives! π
π See orphan tracking documentation for more details.
Each run includes detailed logs to track progress and capture failures:
- Logs are saved per bill under
logs/ - Processing summary shows total bills, events, and votes processed
- Session mapping tracks bill-to-session relationships
- Orphan tracking shows new, existing, and resolved orphans
- Download attempts with success/failure status
- Extraction method used (XML, HTML, PDF)
- Error details saved to
text_extraction_errors/ - Auto-save commits every 30 minutes prevent data loss
- Summary reports include:
- Total documents processed
- Successful extractions by type
- Skipped (already extracted) documents
- Failed downloads/extractions with reasons
Pipelines are fault-tolerant β if a bill fails, the workflow continues for all others.
The text extraction workflow supports:
| Type | Format | Extraction Method | Notes |
|---|---|---|---|
| Bills | XML | Direct XML parsing | Primary bill text |
| Bills | pdfplumber + PyPDF2 | With strikethrough detection | |
| Bills | HTML | BeautifulSoup | Fallback for HTML-only sources |
| Amendments | pdfplumber + PyPDF2 | State amendments only | |
| Documents | PDF/HTML | Auto-detect | CBO reports, committee reports |
Note: Federal congress.gov HTML amendments are currently skipped due to blocking issues. XML bill versions from govinfo.gov work perfectly.
uses: windy-civi/toolkit/actions/scrape@main
with:
state: nm # State abbreviation (required)
github-token: ${{ secrets.GITHUB_TOKEN }}
use-scrape-cache: "false" # Skip scraping, use cached datauses: windy-civi/toolkit/actions/format@main
with:
state: nm # State abbreviation (required)
github-token: ${{ secrets.GITHUB_TOKEN }}uses: windy-civi/toolkit/actions/extract@main
with:
state: nm # State abbreviation (required)
github-token: ${{ secrets.GITHUB_TOKEN }}By default, raw scraped data (_data/) is not stored to keep the repository lightweight.
Uncomment the copy and commit steps in your workflow file:
- name: Copy Scraped Data to Repo
run: |
mkdir -p "$GITHUB_WORKSPACE/_data/$STATE"
cp -r "${RUNNER_TEMP}/_working/_data/$STATE"/* "$GITHUB_WORKSPACE/_data/$STATE/"And include _data in the commit:
git add _data country:us/ .windycivi/Comment out the copy step and exclude _data from the commit command:
git add country:us/ .windycivi/Once enabled, workflows run automatically:
- Scrape & Format: 1am UTC daily
- Text Extraction: 3am UTC daily (runs independently)
- Go to Actions tab in GitHub
- Select the workflow (Scrape & Format or Extract Text)
- Click Run workflow
- Choose the branch and click Run
# Clone the repository
git clone https://github.com/YOUR-ORG/New Mexico Data Pipeline
cd New Mexico Data Pipeline
# Install dependencies
pipenv install
# Run scraping and formatting
pipenv run python scrape_and_format/main.py \
--state il \
--openstates-data-folder /path/to/scraped/data \
--git-repo-folder /path/to/output
# Run text extraction (with incremental flag)
pipenv run python text_extraction/main.py \
--state il \
--data-folder /path/to/output \
--output-folder /path/to/output \
--incrementalSee the known_problems/ directory in the main repository for:
- State-specific scraper issues
- Formatter validation issues
- Text extraction limitations
- Status of all 56 jurisdictions
- GitHub Actions tab shows all runs
- Green checkmark = success
- Red X = failure (click for logs)
- Review
.windycivi/errors/orphaned_placeholders_tracking.jsonfor data issues - Look for chronic orphans (occurrence_count >= 3)
- Check
.windycivi/errors/for formatting/extraction errors - Monitor auto-save commits during text extraction runs
Scraping fails:
- Check if OpenStates scraper for your state is working
- Verify state abbreviation matches OpenStates format
- Check for new legislative sessions not yet configured
Text extraction fails or times out:
- Check
.windycivi/errors/text_extraction_errors/for details - Look for auto-save commits (pipeline saves progress every 30 minutes)
- Re-run the workflow - it will resume from where it left off (incremental)
- Review error logs for specific bills
Orphaned bills appear:
- Check
orphaned_placeholders_tracking.jsonfor details - Verify bill identifiers match between scraper and vote/event data
- Bills may auto-resolve on next scrape if it's a timing issue
Push conflicts:
- The pipeline auto-handles conflicts with
git pull --rebase - If manual resolution needed, check logs for specific conflicts
This template is part of the Windy Civi project. If you're onboarding a new state or improving the automation, feel free to open an issue or PR.
Main Repository: https://github.com/windy-civi/toolkit
For discussions, join our community on Slack or GitHub Discussions.
- β Verify both workflows are enabled
- β Test with manual trigger first (start with Scrape & Format)
- β
Check output in
country:us/state:xx/sessions/ - β
Review
.windycivi/errors/orphaned_placeholders_tracking.jsonfor data quality - β
Check any errors in
.windycivi/errors/ - β Test text extraction workflow independently
- β Enable scheduled runs once testing is successful
- β Monitor first few automated runs for issues
- Incremental Processing Guide - How incremental updates work
- Orphan Tracking Guide - Understanding data quality monitoring
- Main Repository README - Full technical documentation
Part of the Windy Civi ecosystem β building a transparent, verifiable civic data archive for all 50 states.