Skip to content

A Python-based toolset for extracting URLs from legacy .txt and .zip files, and downloading their archived versions from the Wayback Machine, for recovering Psion Epoc / Sibo software from old CD-ROMs

Notifications You must be signed in to change notification settings

scienceapps/Psion-the-lost-archive

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 

Repository files navigation

A Python-based toolset for extracting URLs from legacy .txt and .zip files, and downloading their archived versions from the Wayback Machine, for recovering Psion Epoc / Sibo software from old CD-ROMs. This project was initiated to expand the Psion Software Index : https://github.com/scienceapps/Psion-the-lost-archive

Features :

  • Scans directories for .txt files and .zip archives containing text.
  • Extracts all valid URLs using regular expressions.
  • Downloads archived versions of URLs from the Wayback Machine (1997–2005).
  • Cleans and formats URLs for compatibility with wayback_machine_downloader.
  • Organizes downloads by domain and timestamp.

Extract URLs from legacy files

python 01-extract_urls.py This will scan the specified directory, search for .txt and .txt inside .zip and output a list of URLs to a .txt file.

palmtops_url.txt sample file is generated by placing all the Team Palmtops magazine CD-ROMs unpacked ISO files from issues 01 to 36 in the input directory. https://archive.org/search?query=team+palmtops

Download archived content

python 02-dlfromwayback.py This reads the list of URLs and downloads their archived versions from the Wayback Machine.

Here's a sample output with previous Team Palmtops Magazine : https://archive.org/details/backups_202508

Configuration

You can modify the following paths and parameters directly in the scripts:

  • input_directory: Folder to scan for .txt and .zip files.
  • output_urls_file: Destination file for extracted URLs.
  • url_file_list: Input file for archived downloads.
  • --from / --to: Time range for Wayback Machine snapshots.
  • --only: Regex filter for specific file types (e.g., .zip).

Requirements

  • Python 3.7+
  • wayback_machine_downloader Ruby gem gem install wayback_machine_downloader

About

A Python-based toolset for extracting URLs from legacy .txt and .zip files, and downloading their archived versions from the Wayback Machine, for recovering Psion Epoc / Sibo software from old CD-ROMs

Topics

Resources

Stars

Watchers

Forks

Languages