Skip to content

Commit a759be9

Browse files
author
bitdruid@vbox
committed
Merge branch 'r/3.3.0'
1 parent 5ba2a38 commit a759be9

9 files changed

+232
-121
lines changed

README.md

Lines changed: 67 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -16,16 +16,16 @@ This tool allows you to download content from the Wayback Machine (archive.org).
1616
### Pip
1717

1818
1. Install the package <br>
19-
```pip install pywaybackup```
19+
`pip install pywaybackup`
2020
2. Run the tool <br>
21-
```waybackup -h```
21+
`waybackup -h`
2222

2323
### Manual
2424

2525
1. Clone the repository <br>
26-
```git clone https://github.com/bitdruid/python-wayback-machine-downloader.git```
26+
`git clone https://github.com/bitdruid/python-wayback-machine-downloader.git`
2727
2. Install <br>
28-
```pip install .```
28+
`pip install .`
2929
- in a virtual env or use `--break-system-package`
3030

3131
## notes / issues / hints
@@ -49,6 +49,7 @@ This tool allows you to download content from the Wayback Machine (archive.org).
4949
The URL of the web page to download. This argument is required.
5050

5151
#### Mode Selection (Choose One)
52+
5253
- **`-a`**, **`--all`**:<br>
5354
Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
5455
- **`-l`**, **`--last`**:<br>
@@ -63,66 +64,67 @@ This tool allows you to download content from the Wayback Machine (archive.org).
6364
- **`-e`**, **`--explicit`**:<br>
6465
Only download the explicit given URL. No wildcard subdomains or paths. Use e.g. to get root-only snapshots. This is recommended for explicit files like `login.html` or `?query=this`.
6566

66-
- **`--filetype`** `<filetype>`:<br>
67-
Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
68-
6967
- **`--limit`** `<count>`:<br>
70-
Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
68+
Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
7169

7270
- **Range Selection:**<br>
7371
Specify the range in years or a specific timestamp either start, end, or both. If you specify the `range` argument, the `start` and `end` arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>
7472
(year 2019, year 2019, year+month+day 20190101, year+month+day+hour 2019010112)
75-
- **`-r`**, **`--range`**:<br>
76-
Specify the range in years for which to search and download snapshots.
77-
- **`--start`**:<br>
78-
Timestamp to start searching.
79-
- **`--end`**:<br>
80-
Timestamp to end searching.
73+
74+
- **`-r`**, **`--range`**:<br>
75+
Specify the range in years for which to search and download snapshots.
76+
- **`--start`**:<br>
77+
Timestamp to start searching.
78+
- **`--end`**:<br>
79+
Timestamp to end searching.
80+
81+
- **Filtering:**<br>
82+
A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter.
83+
84+
- **`--filetype`** `<filetype>`:<br>
85+
Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
86+
87+
- **`--statuscode`** `<statuscode>`:<br>
88+
Specify HTTP status codes to download. Default is all statuscodes. Separate multiple status codes with a comma. Example: `--statuscode 200,301`. Pywaybackup will try to download any snapshot regardless of it's statuscode. For 404 of course this means logged errors and corresponding entries in the csv. However, you may want to get a csv that includes these negative attempts for your needs.<br>
89+
Common status codes you may want to handle/filter:
90+
- `200` (OK)
91+
- `301` (Moved Permanently - will redirect snapshot)
92+
- `404` (Not Found - snapshot seems to be empty)
93+
- `500` (Internal Server Error - snapshot is at least for now not available)
8194

8295
### Optional
8396

8497
#### Behavior Manipulation
8598

8699
- **`-o`**, **`--output`**:<br>
87-
Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
100+
Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
88101

89102
- **`-m`**, **`--metadata`**<br>
90-
Change the folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). Especially if you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
103+
Change the folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). Especially if you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
104+
105+
- **`--verbose`**:<br>
106+
Increase output verbosity.
91107

92108
<!-- - **`--verbosity`** `<level>`:<br>
93109
Sets verbosity level. Options are `info`and `trace`. Default is `info`. -->
94110

95111
- **`--log`** <!-- `<path>` -->:<br>
96-
Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
112+
Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
97113

98114
- **`--progress`**:<br>
99-
Shows a progress bar instead of the default output.
115+
Shows a progress bar instead of the default output.
100116

101117
- **`--workers`** `<count>`:<br>
102-
Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
118+
Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
103119

104120
- **`--no-redirect`**:<br>
105-
Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
121+
Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
106122

107123
- **`--retry`** `<attempts>`:<br>
108-
Specifies number of retry attempts for failed downloads.
124+
Specifies number of retry attempts for failed downloads.
109125

110126
- **`--delay`** `<seconds>`:<br>
111-
Specifies delay between download requests in seconds. Default is no delay (0).
112-
113-
- **`--verbose`**:<br>
114-
Increase output verbosity.
115-
- verbose:
116-
```
117-
-----> Worker: 2 - Attempt: [1/1] Snapshot ID: [23/81]
118-
SUCCESS -> 200 OK
119-
-> URL: https://web.archive.org/web/20240225193302id_/https://example.com/assets/css/custom-styles.css
120-
-> FILE: /home/manjaro/Stuff/python-wayback-machine-downloader/waybackup_snapshots/example.com/20240225193302id_/assets/css/custom-styles.css
121-
```
122-
- non-verbose:
123-
```
124-
55/81 - W:2 - SUCCESS - 20240225193302 - https://example.com/assets/css/custom-styles.css
125-
```
127+
Specifies delay between download requests in seconds. Default is no delay (0).
126128

127129
<!-- - **`--convert-links`**:<br>
128130
If set, all links in the downloaded files will be converted to local links. This is useful for offline browsing. The links are converted to the local path structure. Show output with `--verbosity trace`. -->
@@ -147,14 +149,16 @@ If set, all links in the downloaded files will be converted to local links. This
147149
- Detects existing `.cdx` and `.db` files in an `output dir` to resume downloading from the last successful point.
148150
- Compares `URL`, `mode`, and `optional query parameters` to ensure automatic resumption.
149151
- Skips previously downloaded files to save time.
150-
> **Note:** Changing URL, mode selection, query parameters or output prevents automatic resumption.
152+
> **Note:** Changing URL, mode selection, query parameters or output prevents automatic resumption.
151153
152154
#### Resetting a Job (`--reset`)
155+
153156
- Deletes `.cdx` and `.db` files and restarts the process from scratch.
154157
- Does **not** remove already downloaded files.
155158
- `waybackup -u https://example.com -a --reset`
156159

157160
#### Keeping Job Data (`--keep`)
161+
158162
- Normally, `.cdx` and `.db` files are deleted after a successful job.
159163
- `--keep` preserves them for future re-analysis or extending the query.
160164
- `waybackup -u https://example.com -a --keep`
@@ -165,13 +169,13 @@ If set, all links in the downloaded files will be converted to local links. This
165169
## Examples
166170

167171
1. Download a specific single snapshot of all available files (starting from root):<br>
168-
`waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000`
172+
`waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000`
169173
2. Download a specific single snapshot of all available files (starting from a subdirectory):<br>
170-
`waybackup -u https://example.com/subdir1/subdir2/assets/ -a --start 20210101000000 --end 20210101000000`
174+
`waybackup -u https://example.com/subdir1/subdir2/assets/ -a --start 20210101000000 --end 20210101000000`
171175
3. Download a specific single snapshot of the exact given URL (no subdirs):<br>
172-
`waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000 --explicit`
176+
`waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000 --explicit`
173177
4. Download all snapshots of all available files in the given range:<br>
174-
`waybackup -u https://example.com -a --start 20210101000000 --end 20231122000000`
178+
`waybackup -u https://example.com -a --start 20210101000000 --end 20231122000000`
175179

176180
<br>
177181
<br>
@@ -184,7 +188,9 @@ The output path is currently structured as follows by an example for the query:<
184188
`http://example.com/subdir1/subdir2/assets/`
185189
<br><br>
186190
For the first and last version (`-f` or `-l`):
191+
187192
- Will only include all files/folders starting from your query-path.
193+
188194
```
189195
your/path/waybackup_snapshots/
190196
└── the_root_of_your_query/ (example.com/)
@@ -195,8 +201,11 @@ your/path/waybackup_snapshots/
195201
├── style.css
196202
...
197203
```
204+
198205
For all versions (`-a`):
206+
199207
- Will create a folder named as the root of your query. Inside this folder, you will find all timestamps and per timestamp the path you requested.
208+
200209
```
201210
your/path/waybackup_snapshots/
202211
└── the_root_of_your_query/ (example.com/)
@@ -237,6 +246,23 @@ For download queries:
237246
]
238247
```
239248

249+
### Log
250+
251+
Verbose:
252+
253+
```
254+
-----> Worker: 2 - Attempt: [1/1] Snapshot ID: [23/81]
255+
SUCCESS -> 200 OK
256+
-> URL: https://web.archive.org/web/20240225193302id_/https://example.com/assets/css/custom-styles.css
257+
-> FILE: /home/manjaro/Stuff/python-wayback-machine-downloader/waybackup_snapshots/example.com/20240225193302id_/assets/css/custom-styles.css
258+
```
259+
260+
Non-verbose:
261+
262+
```
263+
55/81 - W:2 - SUCCESS - 20240225193302 - https://example.com/assets/css/custom-styles.css
264+
```
265+
240266
### Debugging
241267

242268
Exceptions will be written into `waybackup_error.log` (each run overwrites the file).

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ packages = ["pywaybackup"]
77

88
[project]
99
name = "pywaybackup"
10-
version = "3.2.0"
10+
version = "3.3.0"
1111
description = "Query and download archive.org as simple as possible."
1212
authors = [
1313
{ name = "bitdruid", email = "bitdruid@outlook.com" }

pywaybackup/Arguments.py

Lines changed: 58 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -3,16 +3,19 @@
33
import os
44
import argparse
55

6+
from argparse import RawTextHelpFormatter
7+
68
from importlib.metadata import version
79

810
from pywaybackup.helper import url_split, sanitize_filename
911

1012
class Arguments:
1113

1214
def __init__(self):
13-
14-
parser = argparse.ArgumentParser(description='Download from wayback machine (archive.org)')
15-
parser.add_argument('-v', '--version', action='version', version='%(prog)s ' + version("pywaybackup") + ' by @bitdruid -> https://github.com/bitdruid')
15+
parser = argparse.ArgumentParser(
16+
description=f"<<< python-wayback-machine-downloader v{version('pywaybackup')} >>>\nby @bitdruid -> https://github.com/bitdruid",
17+
formatter_class=RawTextHelpFormatter,
18+
)
1619

1720
required = parser.add_argument_group('required (one exclusive)')
1821
required.add_argument('-u', '--url', type=str, metavar="", help='url (with subdir/subdomain) to download')
@@ -27,20 +30,21 @@ def __init__(self):
2730
optional.add_argument('-r', '--range', type=int, metavar="", help='range in years to search')
2831
optional.add_argument('--start', type=int, metavar="", help='start timestamp format: YYYYMMDDhhmmss')
2932
optional.add_argument('--end', type=int, metavar="", help='end timestamp format: YYYYMMDDhhmmss')
30-
optional.add_argument('--filetype', type=str, metavar="", help='filetypes to download comma separated (e.g. "html,css")')
3133
optional.add_argument('--limit', type=int, nargs='?', const=True, metavar='int', help='limit the number of snapshots to download')
34+
optional.add_argument('--filetype', type=str, metavar="", help='filetypes to download comma separated (js,css,...)')
35+
optional.add_argument('--statuscode', type=str, metavar="", help='statuscodes to download comma separated (200,404,...)')
3236

3337
behavior = parser.add_argument_group('manipulate behavior')
3438
behavior.add_argument('-o', '--output', type=str, metavar="", help='output for all files - defaults to current directory')
3539
behavior.add_argument('-m', '--metadata', type=str, metavar="", help='change directory for db/cdx/csv/log files')
40+
behavior.add_argument('-v', '--verbose', action='store_true', help='overwritten by progress - gives detailed output')
3641
behavior.add_argument('--log', action='store_true', help='save a log file into the output folder')
3742
behavior.add_argument('--progress', action='store_true', help='show a progress bar')
3843
behavior.add_argument('--no-redirect', action='store_true', help='do not follow redirects by archive.org')
3944
behavior.add_argument('--retry', type=int, default=0, metavar="", help='retry failed downloads (opt tries as int, else infinite)')
4045
behavior.add_argument('--workers', type=int, default=1, metavar="", help='number of workers (simultaneous downloads)')
4146
# behavior.add_argument('--convert-links', action='store_true', help='Convert all links in the files to local paths. Requires -c/--current')
4247
behavior.add_argument('--delay', type=int, default=0, metavar="", help='delay between each download in seconds')
43-
behavior.add_argument('--verbose', action='store_true', help='overwritten by progress - gives detailed output')
4448

4549
special = parser.add_argument_group('special')
4650
special.add_argument('--reset', action='store_true', help='reset the job and ignore existing cdx/db/csv files')
@@ -61,6 +65,52 @@ def get_args(self):
6165
return self.args
6266

6367
class Configuration:
68+
69+
# def __init__(self):
70+
# self.args = Arguments().get_args()
71+
# for key, value in vars(self.args).items():
72+
# setattr(Configuration, key, value)
73+
74+
# self.set_config()
75+
76+
# def set_config(self):
77+
# # args now attributes of Configuration // Configuration.output, ...
78+
# self.command = ' '.join(sys.argv[1:])
79+
# self.domain, self.subdir, self.filename = url_split(self.url)
80+
81+
# if self.output is None:
82+
# self.output = os.path.join(os.getcwd(), "waybackup_snapshots")
83+
# if self.metadata is None:
84+
# self.metadata = self.output
85+
# os.makedirs(self.output, exist_ok=True) if not self.save else None
86+
# os.makedirs(self.metadata, exist_ok=True) if not self.save else None
87+
88+
# if self.all:
89+
# self.mode = "all"
90+
# if self.last:
91+
# self.mode = "last"
92+
# if self.first:
93+
# self.mode = "first"
94+
# if self.save:
95+
# self.mode = "save"
96+
97+
# if self.filetype:
98+
# self.filetype = [f.lower().strip() for f in self.filetype.split(",")]
99+
# if self.statuscode:
100+
# self.statuscode = [s.lower().strip() for s in self.statuscode.split(",")]
101+
102+
# base_path = self.metadata
103+
# base_name = f"waybackup_{sanitize_filename(self.url)}"
104+
# self.cdxfile = os.path.join(base_path, f"{base_name}.cdx")
105+
# self.dbfile = os.path.join(base_path, f"{base_name}.db")
106+
# self.csvfile = os.path.join(base_path, f"{base_name}.csv")
107+
# self.log = os.path.join(base_path, f"{base_name}.log") if self.log else None
108+
109+
# if self.reset:
110+
# os.remove(self.cdxfile) if os.path.isfile(self.cdxfile) else None
111+
# os.remove(self.dbfile) if os.path.isfile(self.dbfile) else None
112+
# os.remove(self.csvfile) if os.path.isfile(self.csvfile) else None
113+
64114

65115
@classmethod
66116
def init(cls):
@@ -90,7 +140,9 @@ def init(cls):
90140
cls.mode = "save"
91141

92142
if cls.filetype:
93-
cls.filetype = [ft.lower().strip() for ft in cls.filetype.split(",")]
143+
cls.filetype = [f.lower().strip() for f in cls.filetype.split(",")]
144+
if cls.statuscode:
145+
cls.statuscode = [s.lower().strip() for s in cls.statuscode.split(",")]
94146

95147
base_path = cls.metadata
96148
base_name = f"waybackup_{sanitize_filename(cls.url)}"

0 commit comments

Comments
 (0)