NHL Surf

An efficient and deep URL scraper for NHL.com and its properties. Discovers and extracts all URLs from the NHL website ecosystem and saves them to text files.

Some of these features take a long time to run. There are rate limits that need to be avoided and respected. There is a built in checkpoint system that allows you to resume scraping from where you left off. The scraper will save URLs to text files as they are discovered, so you won't lose progress if it crashes or you need to stop it.

This was built using Claude. Feel free to expand this as needed, it was a quick project to get all the URLs from the NHL website and its properties.

Features

Rate Limited: 1.5-2 second delays between requests to avoid being blocked
Retry Logic: Handles rate limiting with exponential backoff
Persistent Output: Saves URLs immediately to prevent data loss on crashes
NHL Focused: Scrapes URLs from nhl.com, nhle.com, and related domains
Timestamped Output: Results saved to runs/{timestamp}_urls.txt
Checkpoint Support: Resume scraping from where you left off
Graceful Shutdown: Ctrl+C saves checkpoint before exiting
JavaScript Bundle Parsing: Extract URLs from minified JavaScript bundles

Installation & Usage

With uv (recommended)

uv sync
uv run python main.py scrape

With pip

pip install -e .
python main.py scrape

Scrapers

1. NHL Website Scraper

Scrapes all NHL.com URLs and saves them to text files.

# Fresh start
python main.py scrape

# Resume from checkpoint
python main.py scrape --checkpoint runs/20240101_120000_checkpoint.json

# Custom checkpoint interval (save every 50 URLs)
python main.py scrape --checkpoint-interval 50

2. JavaScript Bundle Parser

Extracts URLs from JavaScript bundles by parsing the code for URL patterns.

# Parse a JavaScript bundle
python main.py parse-js https://records.nhl.com/static/js/client.bundle.js

# Parse with custom output directory
python main.py parse-js https://records.nhl.com/static/js/client.bundle.js --output-dir custom_output

JavaScript Parser features:

Downloads and parses JavaScript bundles from any URL
Extracts URLs using multiple regex patterns
Handles various URL formats: absolute, relative, API endpoints, fetch calls
Normalizes and deduplicates URLs
Saves results to timestamped text files

Example JavaScript bundles to parse:

https://records.nhl.com/static/js/client.bundle.js
https://www.nhl.com/static/js/main.bundle.js
Any bundled JavaScript file containing URLs

3. API Scraper (Standalone)

Specialized scraper that only saves URLs containing "api" or "nhle" anywhere in the URL.

# Fresh API scraping
python api_scraper.py

# Resume from checkpoint
python api_scraper.py --checkpoint runs/20240101_120000_api_checkpoint.json

API Scraper targets:

records.nhl.com/site/api/players
api-web.nhle.com/v1/roster
www.nhl.com/api/v1/teams
Any URL containing "api" or "nhle"

Checkpoint System

Both scrapers support automatic checkpoints:

Auto-save: Every 100 URLs processed (configurable)
Graceful shutdown: Ctrl+C saves checkpoint before exiting
Resume capability: Load previous state and continue scraping
Crash recovery: Never lose progress due to unexpected interruptions

Output Files

Main scraper: runs/YYYYMMDD_HHMMSS_urls.txt
API scraper: runs/YYYYMMDD_HHMMSS_api_urls.txt
Checkpoints: runs/YYYYMMDD_HHMMSS_checkpoint.json

The scrapers will continue until all discoverable NHL-related URLs are found and saved.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
output_example		output_example
.gitignore		.gitignore
README.md		README.md
api_scraper.py		api_scraper.py
js_bundle_parser.py		js_bundle_parser.py
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NHL Surf

Features

Installation & Usage

With uv (recommended)

With pip

Scrapers

1. NHL Website Scraper

2. JavaScript Bundle Parser

3. API Scraper (Standalone)

Checkpoint System

Output Files

About

Uh oh!

Releases

Packages

Languages

coreyjs/nhl-surf

Folders and files

Latest commit

History

Repository files navigation

NHL Surf

Features

Installation & Usage

With uv (recommended)

With pip

Scrapers

1. NHL Website Scraper

2. JavaScript Bundle Parser

3. API Scraper (Standalone)

Checkpoint System

Output Files

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages