Skip to content

ci: Add script &CI to check dead links #45

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions .github/workflows/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Broken Links Checking Workflows
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

documentation should stay in the docs folder.

in addition, I don't think such detail in documenting workflows will pay off: they change often and we cannot expect people to update documentation every time. Personally I would discard it completely, better to add just some meaningful comments in the workflows directly


This directory contains GitHub Actions workflows for checking broken links in the Eclipse EDC documentation website.

## Workflows

### 1. `check-broken-links.yaml`

This workflow runs on pull requests to check for broken external links in the documentation. It uses the [lychee-action](https://github.com/lycheeverse/lychee-action) to find and report broken links.

Key features:
- Runs on PR open, synchronize, and reopen events
- Checks only external links (HTTP/HTTPS)
- Caches results to reduce API requests
- Posts a comment on the PR if broken links are found
- Optimized settings to reduce GitHub API rate limit issues (429 errors)

### 2. `check-broken-links-schedule.yaml`

This workflow runs on a schedule (weekly) to detect broken links and creates GitHub issues when problems are found.

Key features:
- Runs every Sunday at 00:00 UTC
- Can also be triggered manually
- Creates a GitHub issue with detailed report when broken links are found
- Uses caching to reduce API requests

## Configuration

The link checking is configured with:

1. `.lycheeignore` file at the repository root - contains patterns of URLs to exclude from checking
2. Command-line arguments in the workflows, which specify:
- Cache settings
- Path exclusions (e.g., `.git`, `node_modules`)
- Only checking HTTP/HTTPS links
- Retries and timeout settings
- Concurrency limits to avoid rate limiting

## Troubleshooting

### Common Issues

1. **GitHub API Rate Limiting (429 errors)**
- The workflow uses several strategies to reduce API requests
- If you still encounter this, consider adding specific GitHub patterns to `.lycheeignore`

2. **False Positives for Hugo Site Links**
- Relative links that work after Hugo builds the site might be reported as broken
- These are ignored using patterns in `.lycheeignore`

3. **High Link Count**
- For better performance, the workflow focuses only on external (HTTP/HTTPS) links
77 changes: 77 additions & 0 deletions .github/workflows/check-broken-links-schedule.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
name: Scheduled Broken Links Check

on:
# Run on schedule
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove obvious comments

schedule:
- cron: "0 0 * * 0" # Runs at 00:00 UTC every Sunday
# Manual trigger
workflow_dispatch:

jobs:
check-links:
runs-on: ubuntu-latest
permissions:
contents: read
issues: write # Permission needed to create issues

steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
fetch-depth: 0

# Optimize caching strategy
- name: Restore lychee cache
uses: actions/cache@v4
with:
path: .lycheecache
key: cache-lychee-scheduled-${{ github.sha }}
restore-keys: |
cache-lychee-scheduled-
cache-lychee-

# Check external links (HTTP/HTTPS URLs only)
- name: Check external links
id: lychee-external
uses: lycheeverse/lychee-action@v2
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
args: >-
--cache
--max-cache-age 72h
--verbose
--no-progress
--exclude-path ".git"
--exclude-path "node_modules"
--exclude-path "themes"
--exclude-path "static/lib"
--scheme "https"
--scheme "http"
--max-retries 6
--retry-wait-time 10
--timeout 45
--max-concurrency 4
--github-token "${{ github.token }}"
'./**/*.md'
'./**/*.html'
fail: false
format: markdown
output: ./lychee-external-report.md

- name: Check report content
id: check-report
run: |
if [ -f ./lychee-external-report.md ] && [ -s ./lychee-external-report.md ] && grep -q "Broken links found" ./lychee-external-report.md; then
echo "broken_links=true" >> $GITHUB_OUTPUT
else
echo "broken_links=false" >> $GITHUB_OUTPUT
fi

- name: Create issue
if: steps.lychee-external.outputs.exit_code != 0 && steps.check-report.outputs.broken_links == 'true'
uses: peter-evans/create-issue-from-file@v5
with:
title: 🔍 External Broken Links Report
content-filepath: ./lychee-external-report.md
labels: bug, documentation
68 changes: 68 additions & 0 deletions .github/workflows/check-broken-links.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to keep two different workflows? they do pretty much the same thing, let's refactor them

Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
name: Check Broken Links

on:
pull_request:
types: [opened, synchronize, reopened]
# Optional: Add scheduled checks
# schedule:
# - cron: "0 0 * * 0" # Runs once every Sunday

jobs:
check-links:
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: write

steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
fetch-depth: 0

# Optimize caching strategy to reduce API requests
- name: Restore lychee cache
uses: actions/cache@v4
with:
path: .lycheecache
key: cache-lychee-${{ github.sha }}
restore-keys: |
cache-lychee-${{ github.event.pull_request.base.sha }}
cache-lychee-

# Check external links only
- name: Check external links
id: lychee-external
uses: lycheeverse/lychee-action@v2
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
args: >-
--cache
--max-cache-age 72h
--verbose
--no-progress
--exclude-path ".git"
--exclude-path "node_modules"
--exclude-path "themes"
--exclude-path "static/lib"
--scheme "https"
--scheme "http"
--max-retries 6
--retry-wait-time 10
--timeout 45
--max-concurrency 4
--github-token "${{ github.token }}"
'./**/*.md'
'./**/*.html'
fail: true
format: markdown
output: ./lychee-external-report.md

# Add check results as PR comment
- name: Create PR comment
uses: peter-evans/create-or-update-comment@v3
if: github.event_name == 'pull_request' && steps.lychee-external.outputs.exit_code != 0
with:
issue-number: ${{ github.event.pull_request.number }}
body-file: ./lychee-external-report.md
62 changes: 62 additions & 0 deletions .lycheeignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# URL patterns to exclude from checking, one regex pattern per line
# These links will be ignored by lychee

# Example domains
^https?://example\.com
^https?://example\.org

# Common temporary URLs or local development URLs
^https?://localhost
^https?://127\.0\.0\.1
^https?://0\.0\.0\.0

# Social media links (often have anti-scraping measures that may cause checks to fail)
^https?://(www\.)?linkedin\.com
^https?://(www\.)?twitter\.com
^https?://(www\.)?facebook\.com
^https?://(www\.)?t\.co

# Files that may have restricted access
\.pdf$

# Local file paths that exist in production but not in CI environment
file:///home/runner/work/eclipse-edc.github.io/eclipse-edc.github.io/content/en/images/edc.schematic.svg
# Exclude all local SVG files as they may be processed during build
file://.*\.svg$
# Exclude content directory files which may be generated during build
file://.*?/content/.*

# GitHub specific patterns to reduce API rate limiting
# These patterns are specifically for repositories that frequently cause 429 errors
^https?://github\.com/git/git/blob/
^https?://raw\.githubusercontent\.com/git/
^https?://api\.github\.com/

# Hugo site specific patterns
# Relative path references - resolved after Hugo build
^/en/
^/images/
^/#.*

# Patterns to handle Hugo path inconsistencies
# Internal file references that are correctly resolved after Hugo build
^\.\.\/
^\.\/
^(/[^/]+)+/$
^(content|static|assets)/.*

# Email addresses
^mailto:.*

# Common special protocol links
^slack://.*
^vscode://.*
^ssh://.*
^git://.*

# Project specific URL patterns to exclude
^https?://eclipse-edc.*\.local/
^https?://connector\.[^/]+/
^https?://api\.[^/]+/

# Add project-specific URL patterns to exclude here
49 changes: 49 additions & 0 deletions .lycheeignore.md
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the .lycheeignore file should document itself, this documentation is totally redundant

Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Lychee Ignore Patterns

This file (`.lycheeignore`) contains regex patterns for URLs that should be ignored by the [lychee](https://github.com/lycheeverse/lychee) link checker when running in the GitHub Actions workflows.

## Purpose

The ignore patterns serve several purposes:

1. **Reduce false positives** - Especially for Hugo-generated sites where relative links that work in production might look broken during checks
2. **Improve performance** - By excluding links that are known to be valid or not important to check
3. **Avoid rate limiting** - By excluding frequent API endpoints that might trigger rate limits

## Pattern Categories

The patterns in `.lycheeignore` are organized into these categories:

1. **Example domains** - Standard placeholder URLs
2. **Development URLs** - Local development links (localhost, etc.)
3. **Social media links** - Often have anti-scraping measures
4. **Restricted files** - Files that may require authentication
5. **Local file paths** - Paths that exist in production but not in CI
6. **GitHub patterns** - To reduce API rate limiting issues
7. **Hugo-specific patterns** - To handle Hugo path differences
8. **Protocol-specific links** - Special protocols like mailto, slack, etc.
9. **Project-specific patterns** - Custom patterns for this project

## Adding New Patterns

When adding new patterns:

1. Be specific to avoid excluding important links
2. Test patterns with the `lychee` CLI tool when possible
3. Add a comment explaining non-obvious patterns
4. Organize patterns in the appropriate category

## Common Pattern Syntax

- `^` - Start of the URL
- `$` - End of the URL
- `\.` - Literal dot (escaped)
- `.*` - Any character, any number of times
- `[^/]+` - One or more characters that are not a slash
- `(pattern1|pattern2)` - Either pattern1 or pattern2

## Examples

- `^https?://example\.com` - Ignores all HTTP and HTTPS URLs to example.com
- `\.pdf$` - Ignores all PDF file links
- `^mailto:.*` - Ignores all mailto links
15 changes: 15 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,3 +39,18 @@ To remove the produced images run:
docker compose rm
```
For more information see the [Docker Compose documentation][].

## Broken Link Checking

This repository includes GitHub Actions workflows to check for broken links:

- PR checks: `.github/workflows/check-broken-links.yaml` (runs on every PR)
- Scheduled checks: `.github/workflows/check-broken-links-schedule.yaml` (runs weekly)

Both workflows use [lychee](https://github.com/lycheeverse/lychee) to detect broken links.

### Configuration

- URL patterns to ignore are specified in `.lycheeignore`
- See `.lycheeignore.md` for documentation on the ignore patterns
- See `.github/workflows/README.md` for workflow documentation
Loading