An efficient and deep URL scraper for NHL.com and its properties. Discovers and extracts all URLs from the NHL website ecosystem and saves them to text files.
Some of these features take a long time to run. There are rate limits that need to be avoided and respected. There is a built in checkpoint system that allows you to resume scraping from where you left off. The scraper will save URLs to text files as they are discovered, so you won't lose progress if it crashes or you need to stop it.
This was built using Claude. Feel free to expand this as needed, it was a quick project to get all the URLs from the NHL website and its properties.
- Rate Limited: 1.5-2 second delays between requests to avoid being blocked
- Retry Logic: Handles rate limiting with exponential backoff
- Persistent Output: Saves URLs immediately to prevent data loss on crashes
- NHL Focused: Scrapes URLs from nhl.com, nhle.com, and related domains
- Timestamped Output: Results saved to
runs/{timestamp}_urls.txt
- Checkpoint Support: Resume scraping from where you left off
- Graceful Shutdown: Ctrl+C saves checkpoint before exiting
- JavaScript Bundle Parsing: Extract URLs from minified JavaScript bundles
uv sync
uv run python main.py scrape
pip install -e .
python main.py scrape
Scrapes all NHL.com URLs and saves them to text files.
# Fresh start
python main.py scrape
# Resume from checkpoint
python main.py scrape --checkpoint runs/20240101_120000_checkpoint.json
# Custom checkpoint interval (save every 50 URLs)
python main.py scrape --checkpoint-interval 50
Extracts URLs from JavaScript bundles by parsing the code for URL patterns.
# Parse a JavaScript bundle
python main.py parse-js https://records.nhl.com/static/js/client.bundle.js
# Parse with custom output directory
python main.py parse-js https://records.nhl.com/static/js/client.bundle.js --output-dir custom_output
JavaScript Parser features:
- Downloads and parses JavaScript bundles from any URL
- Extracts URLs using multiple regex patterns
- Handles various URL formats: absolute, relative, API endpoints, fetch calls
- Normalizes and deduplicates URLs
- Saves results to timestamped text files
Example JavaScript bundles to parse:
https://records.nhl.com/static/js/client.bundle.js
https://www.nhl.com/static/js/main.bundle.js
- Any bundled JavaScript file containing URLs
Specialized scraper that only saves URLs containing "api" or "nhle" anywhere in the URL.
# Fresh API scraping
python api_scraper.py
# Resume from checkpoint
python api_scraper.py --checkpoint runs/20240101_120000_api_checkpoint.json
API Scraper targets:
records.nhl.com/site/api/players
api-web.nhle.com/v1/roster
www.nhl.com/api/v1/teams
- Any URL containing "api" or "nhle"
Both scrapers support automatic checkpoints:
- Auto-save: Every 100 URLs processed (configurable)
- Graceful shutdown: Ctrl+C saves checkpoint before exiting
- Resume capability: Load previous state and continue scraping
- Crash recovery: Never lose progress due to unexpected interruptions
- Main scraper:
runs/YYYYMMDD_HHMMSS_urls.txt
- API scraper:
runs/YYYYMMDD_HHMMSS_api_urls.txt
- Checkpoints:
runs/YYYYMMDD_HHMMSS_checkpoint.json
The scrapers will continue until all discoverable NHL-related URLs are found and saved.