Wayback Machine Downloader

A powerful PHP script to download and process archived websites from the Wayback Machine. This tool helps you create static versions of archived websites, perfect for preservation, offline access, or migration to modern hosting platforms.

Features

Smart URL Processing: Automatically transforms dynamic URLs into static file paths
Resource Handling: Downloads and processes all linked resources (images, CSS, JavaScript)
Broken Link Management: Intelligently handles broken links and missing resources
CloudFlare Integration: Generates CloudFlare Functions for dynamic URL handling
SEO-Friendly Output: Creates clean, static HTML files with proper URL structure
Query Parameter Support: Handles URLs with query parameters by converting them to directory structures
Root Path Preservation: Maintains proper handling of root paths and index files
External Link Management: Adds rel="nofollow" to external links for SEO best practices
Static Site Generation: Creates a complete static site ready for deployment

Requirements

PHP 7.4 or higher
cURL extension
DOM extension
JSON extension

Installation

Clone this repository:

git clone https://github.com/yourusername/wayback-machine-downloader.git
cd wayback-machine-downloader

Ensure PHP and required extensions are installed:

php -m | grep -E 'curl|dom|json'

Usage

Step 1: Download the Website

Use run.php to download the website from the Wayback Machine:

php run.php <domain> <date> [debug_level] [skip_existing] [max_urls] [skip_urls]

Parameters:

domain: The domain to download (e.g., example.com)
date: Target date in YYYYMMDD format
debug_level: Show only errors (error) or all info (info, default)
skip_existing: Skip URLs that already have files (1) or download all (0, default)
max_urls: Maximum number of URLs to process (default: 50)
skip_urls: Comma-separated list of URL patterns to skip (e.g., 'parking.php,/edit/')

Example:

php run.php example.com 20200101 info 0 100 "parking.php,/edit/"

Step 2: Process the Downloaded Files

Use process.php to create a static version of the website:

php process.php <domain> [removeLinksByDomain] [keepLinksByDomain] [cleanXssSelectors]

Parameters:

domain: The domain that was downloaded
removeLinksByDomain: Optional comma-separated list of external domains whose links should be removed (converted to text)
keepLinksByDomain: Optional comma-separated list of external domains whose links should be kept. When specified, all other external domains will be removed (acts as whitelist)
cleanXssSelectors: Optional comma-separated list of XPath selectors for elements that should be cleaned if they contain encoded HTML tags (XSS attacks)

Examples:

# Basic processing - all external links get rel="nofollow"
php process.php example.com

# Remove links from specific domains - other external links get rel="nofollow"
php process.php example.com "spbcompany.com,osdisc.com,affiliate.com"

# Keep only specific domains (remove all others)
php process.php example.com "" "youtube.com,github.com"

# Clean XSS attacks from specific elements
php process.php example.com "" "" "div[@class='margin'],span[@class='spam']"

# Combine all parameters
php process.php example.com "spam.com,ads.com" "youtube.com,github.com" "div[@class='margin']"

XSS Cleaning: The script can detect and clean XSS attacks that appear as encoded HTML tags (e.g., <a href=...>) within specified elements. When such content is found, the element's content is completely removed while preserving the element structure.

Link Processing Priority:

Links from domains in removeLinksByDomain are always removed (highest priority)
If keepLinksByDomain contains actual domains, only links from those domains are kept (whitelist mode)
If keepLinksByDomain is empty or not specified, all other external links get rel="nofollow"

This will:

Process all downloaded files
Create a static site structure
Handle broken links and resources
Generate CloudFlare Functions for dynamic URLs
Create a _redirects file for URL mapping
Add rel="nofollow" to external links that should be kept
Remove links from blocked domains and domains not in the keep list
Clean XSS attacks from specified elements

Output Structure

The processed website will be available in the processed/<domain> directory:

processed/
└── example.com/
    ├── public/           # Static files
    │   ├── index.html
    │   ├── _redirects    # URL mapping rules
    │   └── ...
    └── functions/        # CloudFlare Functions
        └── ...

URL Transformation

The script handles various URL patterns:

Dynamic URLs: /page.php → /page/index.html
Query Parameters: /page.php?id=123 → /page/id_123/index.html
Root Path: / → /index.html
Static Files: /style.css → /style.css (unchanged)

CloudFlare Integration

For URLs with query parameters, the script generates CloudFlare Functions that:

Handle dynamic URL patterns
Maintain proper URL structure
Support SEO-friendly URLs
Preserve query parameter functionality

SEO Optimization

The processed output is optimized for search engines:

Clean URL structure
Proper HTML semantics
External link handling
Resource optimization
Mobile-friendly output

Common Use Cases

Website Preservation
Content Migration
Static Site Generation
Archive Access
Historical Research
Content Recovery
SEO Optimization
CloudFlare Deployment

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Keywords

wayback machine, web archive, static site generator, website preservation, content migration, archive access, historical research, content recovery, SEO optimization, CloudFlare deployment, PHP script, static site, URL transformation, broken link handling, resource management, query parameters, dynamic URLs, website backup, archive downloader, web preservation tool

Professional Web Development Services

Need help with web archiving, data extraction, or custom software development? Our team at SapientPro specializes in:

Custom web scraping solutions
Data extraction and processing
Large-scale data collection
API development and integration
Data analysis and visualization

Visit our website for Custom Software Development Services.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
process.php		process.php
run.php		run.php

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Wayback Machine Downloader

Features

Requirements

Installation

Usage

Step 1: Download the Website

Step 2: Process the Downloaded Files

Output Structure

URL Transformation

CloudFlare Integration

SEO Optimization

Common Use Cases

Contributing

License

Keywords

Professional Web Development Services

About

Uh oh!

Releases

Packages

Languages

License

sapientpro/wayback-machine-downloader

Folders and files

Latest commit

History

Repository files navigation

Wayback Machine Downloader

Features

Requirements

Installation

Usage

Step 1: Download the Website

Step 2: Process the Downloaded Files

Output Structure

URL Transformation

CloudFlare Integration

SEO Optimization

Common Use Cases

Contributing

License

Keywords

Professional Web Development Services

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages