2.0.0
Last version with dictionary-logic and no db-functionality: v1.5.7
This release makes old queries incompatible
Full Changelog: 1.5.0...2.0.0
Removed the use of a dictionary
and replaced with sqlite
. This was in reference to #20, where I realised that due to the amount of information this tool processes and provides, a very large query would crash because the system rans out of memory.
Main changes from 1.5.0
- removed dictionary and replaced with sqlite db to handle large jobs - making system memory less important
- removed
--list
because you can simplycat
thecdx-file
- removed
--json
because for large jobs you have to parse the csv anyways - removed
--skip
because an existing job will always be resumed until it is finished or just in case use--reset
- removed
--cdxbackup
because no use after a job is finished or just in case use--keep
- removed
--cdxinject
because an existing job will always try to inject an existing cdxfile or just in case use--reset
- removed
--auto
because skip functionality is now default and--cdxbackup
and--cdxinject
were removed - removed
--csv
because its now the default output of this tool - removed
--verbosity
because progressbar is now--progress
and--json
got removed - removed
--debug
because error-log is always produced - added
--keep
to prevent deletion of.cdx
and.db
after the job finished - added
--reset
to reset an interrupted query instead of resuming - added
--progress
to show progressbars instead of cli-output - added
--filetype
to filter snapshots by specific filetype (.html not working for now)
Behavior changes
- the cdx-results are now streamed into the
cdxfile
instead of written from a variable to reduce memory-load - all optional
paths
for commands have been removed - defaults to output dir - a query is now identified by its
required
andoptional query
- parameters. if an existing query was identified, the download will resume (with a short info-message and the last status of the query) - the
manipulate behavior
paremeters do not affect that logic - the tool will give you a calculation of the snapshots to utilize, based on filtered/resumed/handled/skipped snapshots
- there are 3 progress-bars to help you getting the status for very large jobs:
- download of the cdx-results
- insertion of the cdx-results into the db
- download of the archived pages
Ideas for the future
- merge all jobs into one db file instead of one db per query
- restructure of the output dir to the logic
waybackup_snapshots/<query>/domains+subdomains+queryfiles/...
instead ofwaybackup_snapshots/domains+subdomains+queryfiles/...
to split queries into exclusive folders
This is the first release with db. As there have been a lot of changes, there are bound to be some bugs, but due to my final thesis at university i have not had much time to find them all - but hopefully most of them. Do not hesitate and open an issue. Please in any case report bugs or improvements!