A Python tool that checks websites for broken links and catalogs internal assets.
- Crawls websites starting from a root URL that respects URL hierarchy boundaries (won't crawl "up" from the starting URL)
- Detects broken internal links
- Catalogs references to non-HTML assets (images, text files, etc.)
- Only visits each page once
- Checks external links but does not crawl them
- Provides detailed logging
- Allows specifying paths to exclude from internal asset reporting
- Supports checking but not crawling specific website sections
pip install rms-link-checker
Or from source:
git clone https://github.com/SETI/rms-link-checker.git
cd rms-link-checker
pip install -e .
You can also install using pipx
, which allows you to install the software and its
dependencies in isolation without needing to set up a virtual environment:
pipx install rms-link-checker
link_checker https://example.com
--verbose
or-v
: Increase verbosity (can be used multiple times)--output
or-o
: Specify output file for results (default: stdout)--log-file
: Write log messages to a file (in addition to console output)--log-level
: Set the minimum level for messages in the log file (DEBUG, INFO, WARNING, ERROR, CRITICAL)--timeout
: Timeout in seconds for HTTP requests (default: 10.0)--max-requests
: Maximum number of requests to make (default: unlimited)--max-depth
: Maximum depth to crawl (default: unlimited)--max-threads
: Maximum number of concurrent threads for requests (default: 10)--ignore-asset-paths-file
: Specify a file containing paths to ignore when reporting internal assets (one per line)--ignore-internal-paths-file
: Specify a file containing paths to check once but not crawl (one per line)--ignore-external-links-file
: Specify a file containing external links to ignore in reporting (one per line)
Simple check:
link_checker https://example.com
Check a specific section of a website (won't crawl to parent directories):
link_checker https://example.com/section/subsection
Ignore specific asset paths:
# Create a file with paths to ignore
echo "/images" > ignore_assets.txt
echo "css" >> ignore_assets.txt # Leading slash is optional
echo "scripts" >> ignore_assets.txt
link_checker https://example.com --ignore-asset-paths-file ignore_assets.txt
Check but don't crawl specific sections:
# Create a file with paths to check but not crawl
echo "docs" > ignore_crawl.txt # Leading slash is optional
echo "/blog" >> ignore_crawl.txt
link_checker https://example.com --ignore-internal-paths-file ignore_crawl.txt
Verbose output with detailed logging:
link_checker https://example.com -vv
Verbose output with logs written to a file:
link_checker https://example.com -vv --log-file=link_checker.log
Verbose output with logs written to a file, but only warnings and errors:
link_checker https://example.com -vv --log-file=link_checker.log --log-level=WARNING
Limit crawl depth and set a longer timeout:
link_checker https://example.com --max-depth=3 --timeout=30.0
Limit the number of requests to avoid overwhelming the server:
link_checker https://example.com --max-requests=50
Control the number of concurrent threads for faster checking on a powerful system:
link_checker https://example.com --max-threads=20
Or reduce threads to be more gentle on the server:
link_checker https://example.com --max-threads=4
The report includes:
- Configuration summary (root URL, hierarchy boundary, and ignored paths)
- Broken links found (grouped by page)
- Internal assets (grouped by type)
- Summary with counts (visited pages, broken links, assets)
- Stats on ignored assets, limited-crawl sections, and URLs outside hierarchy
Information on contributing to this package can be found in the Contributing Guide.
This code is licensed under the Apache License v2.0.