Release 2.0.0 · bitdruid/python-wayback-machine-downloader

Last version with dictionary-logic and no db-functionality: v1.5.7

This release makes old queries incompatible

Full Changelog: 1.5.0...2.0.0

Removed the use of a dictionary and replaced with sqlite. This was in reference to #20, where I realised that due to the amount of information this tool processes and provides, a very large query would crash because the system rans out of memory.

Main changes from 1.5.0

removed dictionary and replaced with sqlite db to handle large jobs - making system memory less important
removed --list because you can simply cat the cdx-file
removed --json because for large jobs you have to parse the csv anyways
removed --skip because an existing job will always be resumed until it is finished or just in case use --reset
removed --cdxbackup because no use after a job is finished or just in case use --keep
removed --cdxinject because an existing job will always try to inject an existing cdxfile or just in case use --reset
removed --auto because skip functionality is now default and --cdxbackup and --cdxinject were removed
removed --csv because its now the default output of this tool
removed --verbosity because progressbar is now --progress and --json got removed
removed --debug because error-log is always produced
added --keep to prevent deletion of .cdx and .db after the job finished
added --reset to reset an interrupted query instead of resuming
added --progress to show progressbars instead of cli-output
added --filetype to filter snapshots by specific filetype (.html not working for now)

Behavior changes

the cdx-results are now streamed into the cdxfile instead of written from a variable to reduce memory-load
all optional paths for commands have been removed - defaults to output dir
a query is now identified by its required and optional query - parameters. if an existing query was identified, the download will resume (with a short info-message and the last status of the query)
the manipulate behavior paremeters do not affect that logic
the tool will give you a calculation of the snapshots to utilize, based on filtered/resumed/handled/skipped snapshots
there are 3 progress-bars to help you getting the status for very large jobs:
- download of the cdx-results
- insertion of the cdx-results into the db
- download of the archived pages

Ideas for the future

merge all jobs into one db file instead of one db per query
restructure of the output dir to the logic waybackup_snapshots/<query>/domains+subdomains+queryfiles/... instead of waybackup_snapshots/domains+subdomains+queryfiles/... to split queries into exclusive folders

This is the first release with db. As there have been a lot of changes, there are bound to be some bugs, but due to my final thesis at university i have not had much time to find them all - but hopefully most of them. Do not hesitate and open an issue. Please in any case report bugs or improvements!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

2.0.0

Last version with dictionary-logic and no db-functionality: v1.5.7

This release makes old queries incompatible

Uh oh!