-
Notifications
You must be signed in to change notification settings - Fork 9
ci: Add script &CI to check dead links #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
90b83b3
3225cf3
9d2cf21
dd4139c
964260d
a035d2a
8539dde
8f904d5
a6fbaed
4c833f6
ec114ff
5e3a74d
6228070
822a02d
bd3aa05
8c44ee4
0eb8fac
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
# Broken Links Checking Workflows | ||
|
||
This directory contains GitHub Actions workflows for checking broken links in the Eclipse EDC documentation website. | ||
|
||
## Workflows | ||
|
||
### 1. `check-broken-links.yaml` | ||
|
||
This workflow runs on pull requests to check for broken external links in the documentation. It uses the [lychee-action](https://github.com/lycheeverse/lychee-action) to find and report broken links. | ||
|
||
Key features: | ||
- Runs on PR open, synchronize, and reopen events | ||
- Checks only external links (HTTP/HTTPS) | ||
- Caches results to reduce API requests | ||
- Posts a comment on the PR if broken links are found | ||
- Optimized settings to reduce GitHub API rate limit issues (429 errors) | ||
|
||
### 2. `check-broken-links-schedule.yaml` | ||
|
||
This workflow runs on a schedule (weekly) to detect broken links and creates GitHub issues when problems are found. | ||
|
||
Key features: | ||
- Runs every Sunday at 00:00 UTC | ||
- Can also be triggered manually | ||
- Creates a GitHub issue with detailed report when broken links are found | ||
- Uses caching to reduce API requests | ||
|
||
## Configuration | ||
|
||
The link checking is configured with: | ||
|
||
1. `.lycheeignore` file at the repository root - contains patterns of URLs to exclude from checking | ||
2. Command-line arguments in the workflows, which specify: | ||
- Cache settings | ||
- Path exclusions (e.g., `.git`, `node_modules`) | ||
- Only checking HTTP/HTTPS links | ||
- Retries and timeout settings | ||
- Concurrency limits to avoid rate limiting | ||
|
||
## Troubleshooting | ||
|
||
### Common Issues | ||
|
||
1. **GitHub API Rate Limiting (429 errors)** | ||
- The workflow uses several strategies to reduce API requests | ||
- If you still encounter this, consider adding specific GitHub patterns to `.lycheeignore` | ||
|
||
2. **False Positives for Hugo Site Links** | ||
- Relative links that work after Hugo builds the site might be reported as broken | ||
- These are ignored using patterns in `.lycheeignore` | ||
|
||
3. **High Link Count** | ||
- For better performance, the workflow focuses only on external (HTTP/HTTPS) links |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,77 @@ | ||
name: Scheduled Broken Links Check | ||
|
||
on: | ||
# Run on schedule | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. please remove obvious comments |
||
schedule: | ||
- cron: "0 0 * * 0" # Runs at 00:00 UTC every Sunday | ||
# Manual trigger | ||
workflow_dispatch: | ||
|
||
jobs: | ||
check-links: | ||
runs-on: ubuntu-latest | ||
permissions: | ||
contents: read | ||
issues: write # Permission needed to create issues | ||
|
||
steps: | ||
- name: Checkout repository | ||
uses: actions/checkout@v4 | ||
with: | ||
fetch-depth: 0 | ||
|
||
# Optimize caching strategy | ||
- name: Restore lychee cache | ||
uses: actions/cache@v4 | ||
with: | ||
path: .lycheecache | ||
key: cache-lychee-scheduled-${{ github.sha }} | ||
restore-keys: | | ||
cache-lychee-scheduled- | ||
cache-lychee- | ||
|
||
# Check external links (HTTP/HTTPS URLs only) | ||
- name: Check external links | ||
id: lychee-external | ||
uses: lycheeverse/lychee-action@v2 | ||
env: | ||
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} | ||
with: | ||
args: >- | ||
--cache | ||
--max-cache-age 72h | ||
--verbose | ||
--no-progress | ||
--exclude-path ".git" | ||
--exclude-path "node_modules" | ||
--exclude-path "themes" | ||
--exclude-path "static/lib" | ||
--scheme "https" | ||
--scheme "http" | ||
--max-retries 6 | ||
--retry-wait-time 10 | ||
--timeout 45 | ||
--max-concurrency 4 | ||
--github-token "${{ github.token }}" | ||
'./**/*.md' | ||
'./**/*.html' | ||
fail: false | ||
format: markdown | ||
output: ./lychee-external-report.md | ||
|
||
- name: Check report content | ||
id: check-report | ||
run: | | ||
if [ -f ./lychee-external-report.md ] && [ -s ./lychee-external-report.md ] && grep -q "Broken links found" ./lychee-external-report.md; then | ||
echo "broken_links=true" >> $GITHUB_OUTPUT | ||
else | ||
echo "broken_links=false" >> $GITHUB_OUTPUT | ||
fi | ||
|
||
- name: Create issue | ||
if: steps.lychee-external.outputs.exit_code != 0 && steps.check-report.outputs.broken_links == 'true' | ||
uses: peter-evans/create-issue-from-file@v5 | ||
with: | ||
title: 🔍 External Broken Links Report | ||
content-filepath: ./lychee-external-report.md | ||
labels: bug, documentation |
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why do we need to keep two different workflows? they do pretty much the same thing, let's refactor them |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
name: Check Broken Links | ||
|
||
on: | ||
pull_request: | ||
types: [opened, synchronize, reopened] | ||
# Optional: Add scheduled checks | ||
# schedule: | ||
# - cron: "0 0 * * 0" # Runs once every Sunday | ||
|
||
jobs: | ||
check-links: | ||
runs-on: ubuntu-latest | ||
permissions: | ||
contents: read | ||
pull-requests: write | ||
|
||
steps: | ||
- name: Checkout repository | ||
uses: actions/checkout@v4 | ||
with: | ||
fetch-depth: 0 | ||
|
||
# Optimize caching strategy to reduce API requests | ||
- name: Restore lychee cache | ||
uses: actions/cache@v4 | ||
with: | ||
path: .lycheecache | ||
key: cache-lychee-${{ github.sha }} | ||
restore-keys: | | ||
cache-lychee-${{ github.event.pull_request.base.sha }} | ||
cache-lychee- | ||
|
||
# Check external links only | ||
- name: Check external links | ||
id: lychee-external | ||
uses: lycheeverse/lychee-action@v2 | ||
env: | ||
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} | ||
with: | ||
args: >- | ||
--cache | ||
--max-cache-age 72h | ||
--verbose | ||
--no-progress | ||
--exclude-path ".git" | ||
--exclude-path "node_modules" | ||
--exclude-path "themes" | ||
--exclude-path "static/lib" | ||
--scheme "https" | ||
--scheme "http" | ||
--max-retries 6 | ||
--retry-wait-time 10 | ||
--timeout 45 | ||
--max-concurrency 4 | ||
--github-token "${{ github.token }}" | ||
'./**/*.md' | ||
'./**/*.html' | ||
fail: true | ||
format: markdown | ||
output: ./lychee-external-report.md | ||
|
||
# Add check results as PR comment | ||
- name: Create PR comment | ||
uses: peter-evans/create-or-update-comment@v3 | ||
if: github.event_name == 'pull_request' && steps.lychee-external.outputs.exit_code != 0 | ||
with: | ||
issue-number: ${{ github.event.pull_request.number }} | ||
body-file: ./lychee-external-report.md |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
# URL patterns to exclude from checking, one regex pattern per line | ||
# These links will be ignored by lychee | ||
|
||
# Example domains | ||
^https?://example\.com | ||
^https?://example\.org | ||
|
||
# Common temporary URLs or local development URLs | ||
^https?://localhost | ||
^https?://127\.0\.0\.1 | ||
^https?://0\.0\.0\.0 | ||
|
||
# Social media links (often have anti-scraping measures that may cause checks to fail) | ||
^https?://(www\.)?linkedin\.com | ||
^https?://(www\.)?twitter\.com | ||
^https?://(www\.)?facebook\.com | ||
^https?://(www\.)?t\.co | ||
|
||
# Files that may have restricted access | ||
\.pdf$ | ||
|
||
# Local file paths that exist in production but not in CI environment | ||
file:///home/runner/work/eclipse-edc.github.io/eclipse-edc.github.io/content/en/images/edc.schematic.svg | ||
# Exclude all local SVG files as they may be processed during build | ||
file://.*\.svg$ | ||
# Exclude content directory files which may be generated during build | ||
file://.*?/content/.* | ||
|
||
# GitHub specific patterns to reduce API rate limiting | ||
# These patterns are specifically for repositories that frequently cause 429 errors | ||
^https?://github\.com/git/git/blob/ | ||
^https?://raw\.githubusercontent\.com/git/ | ||
^https?://api\.github\.com/ | ||
|
||
# Hugo site specific patterns | ||
# Relative path references - resolved after Hugo build | ||
^/en/ | ||
^/images/ | ||
^/#.* | ||
|
||
# Patterns to handle Hugo path inconsistencies | ||
# Internal file references that are correctly resolved after Hugo build | ||
^\.\.\/ | ||
^\.\/ | ||
^(/[^/]+)+/$ | ||
^(content|static|assets)/.* | ||
|
||
# Email addresses | ||
^mailto:.* | ||
|
||
# Common special protocol links | ||
^slack://.* | ||
^vscode://.* | ||
^ssh://.* | ||
^git://.* | ||
|
||
# Project specific URL patterns to exclude | ||
^https?://eclipse-edc.*\.local/ | ||
^https?://connector\.[^/]+/ | ||
^https?://api\.[^/]+/ | ||
|
||
# Add project-specific URL patterns to exclude here |
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the .lycheeignore file should document itself, this documentation is totally redundant |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
# Lychee Ignore Patterns | ||
|
||
This file (`.lycheeignore`) contains regex patterns for URLs that should be ignored by the [lychee](https://github.com/lycheeverse/lychee) link checker when running in the GitHub Actions workflows. | ||
|
||
## Purpose | ||
|
||
The ignore patterns serve several purposes: | ||
|
||
1. **Reduce false positives** - Especially for Hugo-generated sites where relative links that work in production might look broken during checks | ||
2. **Improve performance** - By excluding links that are known to be valid or not important to check | ||
3. **Avoid rate limiting** - By excluding frequent API endpoints that might trigger rate limits | ||
|
||
## Pattern Categories | ||
|
||
The patterns in `.lycheeignore` are organized into these categories: | ||
|
||
1. **Example domains** - Standard placeholder URLs | ||
2. **Development URLs** - Local development links (localhost, etc.) | ||
3. **Social media links** - Often have anti-scraping measures | ||
4. **Restricted files** - Files that may require authentication | ||
5. **Local file paths** - Paths that exist in production but not in CI | ||
6. **GitHub patterns** - To reduce API rate limiting issues | ||
7. **Hugo-specific patterns** - To handle Hugo path differences | ||
8. **Protocol-specific links** - Special protocols like mailto, slack, etc. | ||
9. **Project-specific patterns** - Custom patterns for this project | ||
|
||
## Adding New Patterns | ||
|
||
When adding new patterns: | ||
|
||
1. Be specific to avoid excluding important links | ||
2. Test patterns with the `lychee` CLI tool when possible | ||
3. Add a comment explaining non-obvious patterns | ||
4. Organize patterns in the appropriate category | ||
|
||
## Common Pattern Syntax | ||
|
||
- `^` - Start of the URL | ||
- `$` - End of the URL | ||
- `\.` - Literal dot (escaped) | ||
- `.*` - Any character, any number of times | ||
- `[^/]+` - One or more characters that are not a slash | ||
- `(pattern1|pattern2)` - Either pattern1 or pattern2 | ||
|
||
## Examples | ||
|
||
- `^https?://example\.com` - Ignores all HTTP and HTTPS URLs to example.com | ||
- `\.pdf$` - Ignores all PDF file links | ||
- `^mailto:.*` - Ignores all mailto links |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
documentation should stay in the
docs
folder.in addition, I don't think such detail in documenting workflows will pay off: they change often and we cannot expect people to update documentation every time. Personally I would discard it completely, better to add just some meaningful comments in the workflows directly