You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+47-48
Original file line number
Diff line number
Diff line change
@@ -32,7 +32,7 @@ This tool allows you to download content from the Wayback Machine (archive.org).
32
32
33
33
- Linux recommended: On Windows machines, the path length is limited. This can only be overcome by editing the registry. Files that exceed the path length will not be downloaded.
34
34
- If you query an explicit file (e.g. a query-string `?query=this` or `login.html`), the `--explicit`-argument is recommended as a wildcard query may lead to an empty result.
35
-
- The tool will inform you if your query has an immense amount of snapshots which could consume your system memory and lead to a crash. Consider splitting your query into smaller jobs by specifying a range e.g. `--start 2023 --end 2024` or `--range 1`.
35
+
- The tool uses a sqlite database to handle snapshots. The database will only persist while the download is running.
36
36
37
37
## Arguments
38
38
@@ -54,12 +54,14 @@ This tool allows you to download content from the Wayback Machine (archive.org).
54
54
55
55
### Optional query parameters
56
56
57
-
-**`-l`**, **`--list`**:<br>
58
-
Only print the snapshots available within the specified range. Does not download the snapshots.
59
57
-**`-e`**, **`--explicit`**:<br>
60
58
Only download the explicit given URL. No wildcard subdomains or paths. Use e.g. to get root-only snapshots. This is recommended for explicit files like `login.html` or `?query=this`.
61
-
-**`-o`**, **`--output`**:<br>
62
-
Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
59
+
60
+
-**`--filetype`**`<filetype>`:<br>
61
+
Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
62
+
63
+
-**`--limit`**`<count>`:<br>
64
+
Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
63
65
64
66
-**Range Selection:**<br>
65
67
Specify the range in years or a specific timestamp either start, end, or both. If you specify the `range` argument, the `start` and `end` arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>
@@ -71,72 +73,68 @@ This tool allows you to download content from the Wayback Machine (archive.org).
71
73
-**`--end`**:<br>
72
74
Timestamp to end searching.
73
75
74
-
### Additional behavior manipulation
75
-
76
-
-**`--csv`**`<path>`:<br>
77
-
Path defaults to output-dir. Saves a CSV file with the json-response for successfull downloads. If `--list` is set, the CSV contains the CDX list of snapshots. If `--current` or `--full` is set, CSV contains downloaded files. Named as `waybackup_<sanitized_url>.csv`.
76
+
### Behavior manipulation
78
77
79
-
-**`--skip`**`<path>`:<br>
80
-
Path defaults to output-dir. Checks for an existing `waybackup_<sanitized_url>.csv` for URLs to skip downloading. Useful for interrupted downloads. Files are checked by their root-domain, ensuring consistency across queries. This means that if you download `http://example.com/subdir1/` and later `http://example.com`, the second query will skip the first path.
81
-
82
-
-**`--no-redirect`**:<br>
83
-
Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
84
-
85
-
-**`--verbosity`**`<level>`:<br>
86
-
Sets verbosity level. Options are `json` (prints JSON response) or `progress` (shows progress bar).
87
-
<!-- Alternatively set verbosity level to `trace` for a very detailed output. -->
78
+
-**`-o`**, **`--output`**:<br>
79
+
Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
88
80
89
-
-**`--log`**`<path>`:<br>
90
-
Path defaults to output-dir. Saves a log file with the output of the tool. Named as `waybackup_<sanitized_url>.log`.
81
+
<!-- - **`--verbosity`** `<level>`:<br>
82
+
Sets verbosity level. Options are `info`and `trace`. Default is `info`. -->
83
+
84
+
-**`--log`**<!-- `<path>` -->:<br>
85
+
Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
86
+
87
+
-**`--progress`**:<br>
88
+
Shows a progress bar instead of the default output.
91
89
92
90
-**`--workers`**`<count>`:<br>
93
91
Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
92
+
93
+
-**`--no-redirect`**:<br>
94
+
Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
94
95
95
96
-**`--retry`**`<attempts>`:<br>
96
97
Specifies number of retry attempts for failed downloads.
97
98
98
99
-**`--delay`**`<seconds>`:<br>
99
100
Specifies delay between download requests in seconds. Default is no delay (0).
100
101
101
-
-**`--limit`**`<count>`:<br>
102
-
Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected (with `--cdxinject` or `--auto`), the limit will have no effect.
103
-
104
102
<!-- - **`--convert-links`**:<br>
105
103
If set, all links in the downloaded files will be converted to local links. This is useful for offline browsing. The links are converted to the local path structure. Show output with `--verbosity trace`. -->
106
104
107
-
**CDX Query Result Handling:**
108
-
-**`--cdxbackup`**`<path>`:<br>
109
-
Path defaults to output-dir. Saves the result of CDX query as a file. Useful for later downloading snapshots and overcoming refused connections by CDX server due to too many queries. Named as `waybackup_<sanitized_url>.cdx`.
110
-
111
-
-**`--cdxinject`**`<filepath>`:<br>
112
-
Injects a CDX query file to download snapshots. Ensure the query matches the previous `--url` for correct folder structure.
105
+
## Special:
106
+
107
+
-**`--reset`**:
108
+
If set, the job will be reset, and any existing `cdx`, `db`, `csv` files will be **deleted**. This allows you to start the job from scratch without considering previously downloaded data.
113
109
114
-
**Auto:**
115
-
-**`--auto`**:<br>
116
-
If set, csv, skip and cdxbackup/cdxinject are handled automatically. Keep the files and folders as they are. Otherwise they will not be recognized when restarting a download.
110
+
-**`--keep`**:
111
+
If set, all files will be kept after the job is finished. This includes the `cdx` and `db` file. Without this argument, they will be deleted if the job finished successfully.
117
112
118
113
### Examples
119
114
120
-
Download latest snapshot of all files:<br>
115
+
Download the latest snapshot of all available files:<br>
Each snapshot is stored with the following keys/values. These are either stored in a sqlite database while the download is running or saved into a CSV file after the download is finished.
180
179
181
180
For download queries:
182
181
@@ -212,14 +211,14 @@ For list queries:
212
211
]
213
212
```
214
213
215
-
## CSV Output
216
-
217
-
The csv contains the json response in a table format.
218
-
219
214
### Debugging
220
215
221
216
Exceptions will be written into `waybackup_error.log` (each run overwrites the file).
222
217
218
+
### Known ToDos
219
+
220
+
-[ ] currently there is no logic to handle if both a http and https version of a page is available
221
+
223
222
## Contributing
224
223
225
224
I'm always happy for some feature requests to improve the usability of this tool.
special.add_argument('--csv', type=str, nargs='?', const=True, metavar='path', help='save a csv file with the json output - defaults to output folder')
34
-
special.add_argument('--skip', type=str, nargs='?', const=True, metavar='path', help='skips existing files in the output folder by checking the .csv file - defaults to output folder')
35
-
special.add_argument('--no-redirect', action='store_true', help='do not follow redirects by archive.org')
36
-
special.add_argument('--verbosity', type=str, default="info", metavar="", help='["progress", "json"] for different output or ["trace"] for very detailed output')
37
-
special.add_argument('--log', type=str, nargs='?', const=True, metavar='path', help='save a log file - defaults to output folder')
exclusive_cdx.add_argument('--cdxbackup', type=str, nargs='?', const=True, metavar='path', help='Save the cdx query-result to a file for recurent use - defaults to output folder')
47
-
exclusive_cdx.add_argument('--cdxinject', type=str, nargs='?', const=True, metavar='path', help='Inject a cdx backup-file to download according to the given url')
48
-
49
-
auto=parser.add_argument_group('auto')
50
-
auto.add_argument('--auto', action='store_true', help='includes automatic csv, skip and cdxbackup/cdxinject to resume a stopped download')
29
+
optional.add_argument('--filetype', type=str, metavar="", help='filetypes to download comma separated (e.g. "html,css")')
30
+
optional.add_argument('--limit', type=int, nargs='?', const=True, metavar='int', help='limit the number of snapshots to download')
0 commit comments