A powerful PHP script to download and process archived websites from the Wayback Machine. This tool helps you create static versions of archived websites, perfect for preservation, offline access, or migration to modern hosting platforms.
- Smart URL Processing: Automatically transforms dynamic URLs into static file paths
- Resource Handling: Downloads and processes all linked resources (images, CSS, JavaScript)
- Broken Link Management: Intelligently handles broken links and missing resources
- CloudFlare Integration: Generates CloudFlare Functions for dynamic URL handling
- SEO-Friendly Output: Creates clean, static HTML files with proper URL structure
- Query Parameter Support: Handles URLs with query parameters by converting them to directory structures
- Root Path Preservation: Maintains proper handling of root paths and index files
- External Link Management: Adds rel="nofollow" to external links for SEO best practices
- Static Site Generation: Creates a complete static site ready for deployment
- PHP 7.4 or higher
- cURL extension
- DOM extension
- JSON extension
- Clone this repository:
git clone https://github.com/yourusername/wayback-machine-downloader.git
cd wayback-machine-downloader
- Ensure PHP and required extensions are installed:
php -m | grep -E 'curl|dom|json'
Use run.php
to download the website from the Wayback Machine:
php run.php <domain> <date> [debug_level] [skip_existing] [max_urls] [skip_urls]
Parameters:
domain
: The domain to download (e.g., example.com)date
: Target date in YYYYMMDD formatdebug_level
: Show only errors (error) or all info (info, default)skip_existing
: Skip URLs that already have files (1) or download all (0, default)max_urls
: Maximum number of URLs to process (default: 50)skip_urls
: Comma-separated list of URL patterns to skip (e.g., 'parking.php,/edit/')
Example:
php run.php example.com 20200101 info 0 100 "parking.php,/edit/"
Use process.php
to create a static version of the website:
php process.php <domain> [removeLinksByDomain] [keepLinksByDomain] [cleanXssSelectors]
Parameters:
domain
: The domain that was downloadedremoveLinksByDomain
: Optional comma-separated list of external domains whose links should be removed (converted to text)keepLinksByDomain
: Optional comma-separated list of external domains whose links should be kept. When specified, all other external domains will be removed (acts as whitelist)cleanXssSelectors
: Optional comma-separated list of XPath selectors for elements that should be cleaned if they contain encoded HTML tags (XSS attacks)
Examples:
# Basic processing - all external links get rel="nofollow"
php process.php example.com
# Remove links from specific domains - other external links get rel="nofollow"
php process.php example.com "spbcompany.com,osdisc.com,affiliate.com"
# Keep only specific domains (remove all others)
php process.php example.com "" "youtube.com,github.com"
# Clean XSS attacks from specific elements
php process.php example.com "" "" "div[@class='margin'],span[@class='spam']"
# Combine all parameters
php process.php example.com "spam.com,ads.com" "youtube.com,github.com" "div[@class='margin']"
XSS Cleaning:
The script can detect and clean XSS attacks that appear as encoded HTML tags (e.g., <a href=...>
) within specified elements. When such content is found, the element's content is completely removed while preserving the element structure.
Link Processing Priority:
- Links from domains in
removeLinksByDomain
are always removed (highest priority) - If
keepLinksByDomain
contains actual domains, only links from those domains are kept (whitelist mode) - If
keepLinksByDomain
is empty or not specified, all other external links getrel="nofollow"
This will:
- Process all downloaded files
- Create a static site structure
- Handle broken links and resources
- Generate CloudFlare Functions for dynamic URLs
- Create a
_redirects
file for URL mapping - Add
rel="nofollow"
to external links that should be kept - Remove links from blocked domains and domains not in the keep list
- Clean XSS attacks from specified elements
The processed website will be available in the processed/<domain>
directory:
processed/
└── example.com/
├── public/ # Static files
│ ├── index.html
│ ├── _redirects # URL mapping rules
│ └── ...
└── functions/ # CloudFlare Functions
└── ...
The script handles various URL patterns:
- Dynamic URLs:
/page.php
→/page/index.html
- Query Parameters:
/page.php?id=123
→/page/id_123/index.html
- Root Path:
/
→/index.html
- Static Files:
/style.css
→/style.css
(unchanged)
For URLs with query parameters, the script generates CloudFlare Functions that:
- Handle dynamic URL patterns
- Maintain proper URL structure
- Support SEO-friendly URLs
- Preserve query parameter functionality
The processed output is optimized for search engines:
- Clean URL structure
- Proper HTML semantics
- External link handling
- Resource optimization
- Mobile-friendly output
- Website Preservation
- Content Migration
- Static Site Generation
- Archive Access
- Historical Research
- Content Recovery
- SEO Optimization
- CloudFlare Deployment
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
wayback machine, web archive, static site generator, website preservation, content migration, archive access, historical research, content recovery, SEO optimization, CloudFlare deployment, PHP script, static site, URL transformation, broken link handling, resource management, query parameters, dynamic URLs, website backup, archive downloader, web preservation tool
Need help with web archiving, data extraction, or custom software development? Our team at SapientPro specializes in:
- Custom web scraping solutions
- Data extraction and processing
- Large-scale data collection
- API development and integration
- Data analysis and visualization
Visit our website for Custom Software Development Services.