You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- in a virtual env or use `--break-system-package`
30
30
31
31
## notes / issues / hints
@@ -49,6 +49,7 @@ This tool allows you to download content from the Wayback Machine (archive.org).
49
49
The URL of the web page to download. This argument is required.
50
50
51
51
#### Mode Selection (Choose One)
52
+
52
53
-**`-a`**, **`--all`**:<br>
53
54
Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
54
55
-**`-l`**, **`--last`**:<br>
@@ -63,66 +64,67 @@ This tool allows you to download content from the Wayback Machine (archive.org).
63
64
-**`-e`**, **`--explicit`**:<br>
64
65
Only download the explicit given URL. No wildcard subdomains or paths. Use e.g. to get root-only snapshots. This is recommended for explicit files like `login.html` or `?query=this`.
65
66
66
-
-**`--filetype`**`<filetype>`:<br>
67
-
Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
68
-
69
67
-**`--limit`**`<count>`:<br>
70
-
Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
68
+
Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
71
69
72
70
-**Range Selection:**<br>
73
71
Specify the range in years or a specific timestamp either start, end, or both. If you specify the `range` argument, the `start` and `end` arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>
74
72
(year 2019, year 2019, year+month+day 20190101, year+month+day+hour 2019010112)
75
-
-**`-r`**, **`--range`**:<br>
76
-
Specify the range in years for which to search and download snapshots.
77
-
-**`--start`**:<br>
78
-
Timestamp to start searching.
79
-
-**`--end`**:<br>
80
-
Timestamp to end searching.
73
+
74
+
-**`-r`**, **`--range`**:<br>
75
+
Specify the range in years for which to search and download snapshots.
76
+
-**`--start`**:<br>
77
+
Timestamp to start searching.
78
+
-**`--end`**:<br>
79
+
Timestamp to end searching.
80
+
81
+
-**Filtering:**<br>
82
+
A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter.
83
+
84
+
-**`--filetype`**`<filetype>`:<br>
85
+
Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
86
+
87
+
-**`--statuscode`**`<statuscode>`:<br>
88
+
Specify HTTP status codes to download. Default is all statuscodes. Separate multiple status codes with a comma. Example: `--statuscode 200,301`. Pywaybackup will try to download any snapshot regardless of it's statuscode. For 404 of course this means logged errors and corresponding entries in the csv. However, you may want to get a csv that includes these negative attempts for your needs.<br>
89
+
Common status codes you may want to handle/filter:
90
+
-`200` (OK)
91
+
-`301` (Moved Permanently - will redirect snapshot)
92
+
-`404` (Not Found - snapshot seems to be empty)
93
+
-`500` (Internal Server Error - snapshot is at least for now not available)
81
94
82
95
### Optional
83
96
84
97
#### Behavior Manipulation
85
98
86
99
-**`-o`**, **`--output`**:<br>
87
-
Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
100
+
Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
88
101
89
102
-**`-m`**, **`--metadata`**<br>
90
-
Change the folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). Especially if you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
103
+
Change the folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). Especially if you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
104
+
105
+
-**`--verbose`**:<br>
106
+
Increase output verbosity.
91
107
92
108
<!-- - **`--verbosity`** `<level>`:<br>
93
109
Sets verbosity level. Options are `info`and `trace`. Default is `info`. -->
94
110
95
111
-**`--log`**<!-- `<path>` -->:<br>
96
-
Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
112
+
Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
97
113
98
114
-**`--progress`**:<br>
99
-
Shows a progress bar instead of the default output.
115
+
Shows a progress bar instead of the default output.
100
116
101
117
-**`--workers`**`<count>`:<br>
102
-
Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
118
+
Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
103
119
104
120
-**`--no-redirect`**:<br>
105
-
Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
121
+
Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
106
122
107
123
-**`--retry`**`<attempts>`:<br>
108
-
Specifies number of retry attempts for failed downloads.
124
+
Specifies number of retry attempts for failed downloads.
109
125
110
126
-**`--delay`**`<seconds>`:<br>
111
-
Specifies delay between download requests in seconds. Default is no delay (0).
Specifies delay between download requests in seconds. Default is no delay (0).
126
128
127
129
<!-- - **`--convert-links`**:<br>
128
130
If set, all links in the downloaded files will be converted to local links. This is useful for offline browsing. The links are converted to the local path structure. Show output with `--verbosity trace`. -->
@@ -147,14 +149,16 @@ If set, all links in the downloaded files will be converted to local links. This
147
149
- Detects existing `.cdx` and `.db` files in an `output dir` to resume downloading from the last successful point.
148
150
- Compares `URL`, `mode`, and `optional query parameters` to ensure automatic resumption.
0 commit comments