Link Checker

A Python tool that checks websites for broken links and catalogs internal assets.

Features

Crawls websites starting from a root URL that respects URL hierarchy boundaries (won't crawl "up" from the starting URL)
Detects broken internal links
Catalogs references to non-HTML assets (images, text files, etc.)
Only visits each page once
Checks external links but does not crawl them
Provides detailed logging
Allows specifying paths to exclude from internal asset reporting
Supports checking but not crawling specific website sections

Installation

pip install rms-link-checker

Or from source:

git clone https://github.com/SETI/rms-link-checker.git
cd rms-link-checker
pip install -e .

You can also install using pipx, which allows you to install the software and its dependencies in isolation without needing to set up a virtual environment:

pipx install rms-link-checker

Usage

link_checker https://example.com

Options

--verbose or -v: Increase verbosity (can be used multiple times)
--output or -o: Specify output file for results (default: stdout)
--log-file: Write log messages to a file (in addition to console output)
--log-level: Set the minimum level for messages in the log file (DEBUG, INFO, WARNING, ERROR, CRITICAL)
--timeout: Timeout in seconds for HTTP requests (default: 10.0)
--max-requests: Maximum number of requests to make (default: unlimited)
--max-depth: Maximum depth to crawl (default: unlimited)
--max-threads: Maximum number of concurrent threads for requests (default: 10)
--ignore-asset-paths-file: Specify a file containing paths to ignore when reporting internal assets (one per line)
--ignore-internal-paths-file: Specify a file containing paths to check once but not crawl (one per line)
--ignore-external-links-file: Specify a file containing external links to ignore in reporting (one per line)

Examples

Simple check:

link_checker https://example.com

Check a specific section of a website (won't crawl to parent directories):

link_checker https://example.com/section/subsection

Ignore specific asset paths:

# Create a file with paths to ignore
echo "/images" > ignore_assets.txt
echo "css" >> ignore_assets.txt      # Leading slash is optional
echo "scripts" >> ignore_assets.txt

link_checker https://example.com --ignore-asset-paths-file ignore_assets.txt

Check but don't crawl specific sections:

# Create a file with paths to check but not crawl
echo "docs" > ignore_crawl.txt       # Leading slash is optional
echo "/blog" >> ignore_crawl.txt

link_checker https://example.com --ignore-internal-paths-file ignore_crawl.txt

Verbose output with detailed logging:

link_checker https://example.com -vv

Verbose output with logs written to a file:

link_checker https://example.com -vv --log-file=link_checker.log

Verbose output with logs written to a file, but only warnings and errors:

link_checker https://example.com -vv --log-file=link_checker.log --log-level=WARNING

Limit crawl depth and set a longer timeout:

link_checker https://example.com --max-depth=3 --timeout=30.0

Limit the number of requests to avoid overwhelming the server:

link_checker https://example.com --max-requests=50

Control the number of concurrent threads for faster checking on a powerful system:

link_checker https://example.com --max-threads=20

Or reduce threads to be more gentle on the server:

link_checker https://example.com --max-threads=4

Report Format

The report includes:

Configuration summary (root URL, hierarchy boundary, and ignored paths)
Broken links found (grouped by page)
Internal assets (grouped by type)
Summary with counts (visited pages, broken links, assets)
Stats on ignored assets, limited-crawl sections, and URLs outside hierarchy

Contributing

Information on contributing to this package can be found in the Contributing Guide.

Links

Licensing

This code is licensed under the Apache License v2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
docs		docs
link_checker		link_checker
tests		tests
.coveragerc		.coveragerc
.flake8		.flake8
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.cfg.old		setup.cfg.old

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Link Checker

Features

Installation

Usage

Options

Examples

Report Format

Contributing

Links

Licensing

About

Uh oh!

Releases 3

Packages

Uh oh!

Languages

License

SETI/rms-link-checker

Folders and files

Latest commit

History

Repository files navigation

Link Checker

Features

Installation

Usage

Options

Examples

Report Format

Contributing

Links

Licensing

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Languages

Packages