A Python-based toolset for extracting URLs from legacy .txt and .zip files, and downloading their archived versions from the Wayback Machine, for recovering Psion Epoc / Sibo software from old CD-ROMs. This project was initiated to expand the Psion Software Index : https://github.com/scienceapps/Psion-the-lost-archive
- Scans directories for .txt files and .zip archives containing text.
- Extracts all valid URLs using regular expressions.
- Downloads archived versions of URLs from the Wayback Machine (1997–2005).
- Cleans and formats URLs for compatibility with wayback_machine_downloader.
- Organizes downloads by domain and timestamp.
python 01-extract_urls.py
This will scan the specified directory, search for .txt and .txt inside .zip and output a list of URLs to a .txt file.
palmtops_url.txt
sample file is generated by placing all the Team Palmtops magazine CD-ROMs unpacked ISO files from issues 01 to 36 in the input directory. https://archive.org/search?query=team+palmtops
python 02-dlfromwayback.py
This reads the list of URLs and downloads their archived versions from the Wayback Machine.
Here's a sample output with previous Team Palmtops Magazine : https://archive.org/details/backups_202508
You can modify the following paths and parameters directly in the scripts:
- input_directory: Folder to scan for .txt and .zip files.
- output_urls_file: Destination file for extracted URLs.
- url_file_list: Input file for archived downloads.
- --from / --to: Time range for Wayback Machine snapshots.
- --only: Regex filter for specific file types (e.g., .zip).
- Python 3.7+
- wayback_machine_downloader Ruby gem
gem install wayback_machine_downloader