Skip to content

2.0.0

Compare
Choose a tag to compare
@bitdruid bitdruid released this 19 Oct 12:42
· 11 commits to main since this release

Last version with dictionary-logic and no db-functionality: v1.5.7

This release makes old queries incompatible

Full Changelog: 1.5.0...2.0.0

Removed the use of a dictionary and replaced with sqlite. This was in reference to #20, where I realised that due to the amount of information this tool processes and provides, a very large query would crash because the system rans out of memory.

Main changes from 1.5.0

  • removed dictionary and replaced with sqlite db to handle large jobs - making system memory less important
  • removed --list because you can simply cat the cdx-file
  • removed --json because for large jobs you have to parse the csv anyways
  • removed --skip because an existing job will always be resumed until it is finished or just in case use --reset
  • removed --cdxbackup because no use after a job is finished or just in case use --keep
  • removed --cdxinject because an existing job will always try to inject an existing cdxfile or just in case use --reset
  • removed --auto because skip functionality is now default and --cdxbackup and --cdxinject were removed
  • removed --csv because its now the default output of this tool
  • removed --verbosity because progressbar is now --progress and --json got removed
  • removed --debug because error-log is always produced
  • added --keep to prevent deletion of .cdx and .db after the job finished
  • added --reset to reset an interrupted query instead of resuming
  • added --progress to show progressbars instead of cli-output
  • added --filetype to filter snapshots by specific filetype (.html not working for now)

Behavior changes

  • the cdx-results are now streamed into the cdxfile instead of written from a variable to reduce memory-load
  • all optional paths for commands have been removed - defaults to output dir
  • a query is now identified by its required and optional query - parameters. if an existing query was identified, the download will resume (with a short info-message and the last status of the query)
  • the manipulate behavior paremeters do not affect that logic
  • the tool will give you a calculation of the snapshots to utilize, based on filtered/resumed/handled/skipped snapshots
  • there are 3 progress-bars to help you getting the status for very large jobs:
    • download of the cdx-results
    • insertion of the cdx-results into the db
    • download of the archived pages

Ideas for the future

  • merge all jobs into one db file instead of one db per query
  • restructure of the output dir to the logic waybackup_snapshots/<query>/domains+subdomains+queryfiles/... instead of waybackup_snapshots/domains+subdomains+queryfiles/... to split queries into exclusive folders

This is the first release with db. As there have been a lot of changes, there are bound to be some bugs, but due to my final thesis at university i have not had much time to find them all - but hopefully most of them. Do not hesitate and open an issue. Please in any case report bugs or improvements!