Skip to content

coreyjs/nhl-surf

Repository files navigation

NHL Surf

An efficient and deep URL scraper for NHL.com and its properties. Discovers and extracts all URLs from the NHL website ecosystem and saves them to text files.

Some of these features take a long time to run. There are rate limits that need to be avoided and respected. There is a built in checkpoint system that allows you to resume scraping from where you left off. The scraper will save URLs to text files as they are discovered, so you won't lose progress if it crashes or you need to stop it.

This was built using Claude. Feel free to expand this as needed, it was a quick project to get all the URLs from the NHL website and its properties.

Features

  • Rate Limited: 1.5-2 second delays between requests to avoid being blocked
  • Retry Logic: Handles rate limiting with exponential backoff
  • Persistent Output: Saves URLs immediately to prevent data loss on crashes
  • NHL Focused: Scrapes URLs from nhl.com, nhle.com, and related domains
  • Timestamped Output: Results saved to runs/{timestamp}_urls.txt
  • Checkpoint Support: Resume scraping from where you left off
  • Graceful Shutdown: Ctrl+C saves checkpoint before exiting
  • JavaScript Bundle Parsing: Extract URLs from minified JavaScript bundles

Installation & Usage

With uv (recommended)

uv sync
uv run python main.py scrape

With pip

pip install -e .
python main.py scrape

Scrapers

1. NHL Website Scraper

Scrapes all NHL.com URLs and saves them to text files.

# Fresh start
python main.py scrape

# Resume from checkpoint
python main.py scrape --checkpoint runs/20240101_120000_checkpoint.json

# Custom checkpoint interval (save every 50 URLs)
python main.py scrape --checkpoint-interval 50

2. JavaScript Bundle Parser

Extracts URLs from JavaScript bundles by parsing the code for URL patterns.

# Parse a JavaScript bundle
python main.py parse-js https://records.nhl.com/static/js/client.bundle.js

# Parse with custom output directory
python main.py parse-js https://records.nhl.com/static/js/client.bundle.js --output-dir custom_output

JavaScript Parser features:

  • Downloads and parses JavaScript bundles from any URL
  • Extracts URLs using multiple regex patterns
  • Handles various URL formats: absolute, relative, API endpoints, fetch calls
  • Normalizes and deduplicates URLs
  • Saves results to timestamped text files

Example JavaScript bundles to parse:

  • https://records.nhl.com/static/js/client.bundle.js
  • https://www.nhl.com/static/js/main.bundle.js
  • Any bundled JavaScript file containing URLs

3. API Scraper (Standalone)

Specialized scraper that only saves URLs containing "api" or "nhle" anywhere in the URL.

# Fresh API scraping
python api_scraper.py

# Resume from checkpoint
python api_scraper.py --checkpoint runs/20240101_120000_api_checkpoint.json

API Scraper targets:

  • records.nhl.com/site/api/players
  • api-web.nhle.com/v1/roster
  • www.nhl.com/api/v1/teams
  • Any URL containing "api" or "nhle"

Checkpoint System

Both scrapers support automatic checkpoints:

  • Auto-save: Every 100 URLs processed (configurable)
  • Graceful shutdown: Ctrl+C saves checkpoint before exiting
  • Resume capability: Load previous state and continue scraping
  • Crash recovery: Never lose progress due to unexpected interruptions

Output Files

  • Main scraper: runs/YYYYMMDD_HHMMSS_urls.txt
  • API scraper: runs/YYYYMMDD_HHMMSS_api_urls.txt
  • Checkpoints: runs/YYYYMMDD_HHMMSS_checkpoint.json

The scrapers will continue until all discoverable NHL-related URLs are found and saved.

About

NHL Web Scrapper. To find and extract all hidden URLs and API endpoints.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages