Skip to content

Commit 933f4ee

Browse files
committed
Merge branch 'r/2.0.0'
1 parent 78c6535 commit 933f4ee

File tree

9 files changed

+673
-481
lines changed

9 files changed

+673
-481
lines changed

README.md

+47-48
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ This tool allows you to download content from the Wayback Machine (archive.org).
3232

3333
- Linux recommended: On Windows machines, the path length is limited. This can only be overcome by editing the registry. Files that exceed the path length will not be downloaded.
3434
- If you query an explicit file (e.g. a query-string `?query=this` or `login.html`), the `--explicit`-argument is recommended as a wildcard query may lead to an empty result.
35-
- The tool will inform you if your query has an immense amount of snapshots which could consume your system memory and lead to a crash. Consider splitting your query into smaller jobs by specifying a range e.g. `--start 2023 --end 2024` or `--range 1`.
35+
- The tool uses a sqlite database to handle snapshots. The database will only persist while the download is running.
3636

3737
## Arguments
3838

@@ -54,12 +54,14 @@ This tool allows you to download content from the Wayback Machine (archive.org).
5454

5555
### Optional query parameters
5656

57-
- **`-l`**, **`--list`**:<br>
58-
Only print the snapshots available within the specified range. Does not download the snapshots.
5957
- **`-e`**, **`--explicit`**:<br>
6058
Only download the explicit given URL. No wildcard subdomains or paths. Use e.g. to get root-only snapshots. This is recommended for explicit files like `login.html` or `?query=this`.
61-
- **`-o`**, **`--output`**:<br>
62-
Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
59+
60+
- **`--filetype`** `<filetype>`:<br>
61+
Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
62+
63+
- **`--limit`** `<count>`:<br>
64+
Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
6365

6466
- **Range Selection:**<br>
6567
Specify the range in years or a specific timestamp either start, end, or both. If you specify the `range` argument, the `start` and `end` arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>
@@ -71,72 +73,68 @@ This tool allows you to download content from the Wayback Machine (archive.org).
7173
- **`--end`**:<br>
7274
Timestamp to end searching.
7375

74-
### Additional behavior manipulation
75-
76-
- **`--csv`** `<path>`:<br>
77-
Path defaults to output-dir. Saves a CSV file with the json-response for successfull downloads. If `--list` is set, the CSV contains the CDX list of snapshots. If `--current` or `--full` is set, CSV contains downloaded files. Named as `waybackup_<sanitized_url>.csv`.
76+
### Behavior manipulation
7877

79-
- **`--skip`** `<path>`:<br>
80-
Path defaults to output-dir. Checks for an existing `waybackup_<sanitized_url>.csv` for URLs to skip downloading. Useful for interrupted downloads. Files are checked by their root-domain, ensuring consistency across queries. This means that if you download `http://example.com/subdir1/` and later `http://example.com`, the second query will skip the first path.
81-
82-
- **`--no-redirect`**:<br>
83-
Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
84-
85-
- **`--verbosity`** `<level>`:<br>
86-
Sets verbosity level. Options are `json` (prints JSON response) or `progress` (shows progress bar).
87-
<!-- Alternatively set verbosity level to `trace` for a very detailed output. -->
78+
- **`-o`**, **`--output`**:<br>
79+
Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
8880

89-
- **`--log`** `<path>`:<br>
90-
Path defaults to output-dir. Saves a log file with the output of the tool. Named as `waybackup_<sanitized_url>.log`.
81+
<!-- - **`--verbosity`** `<level>`:<br>
82+
Sets verbosity level. Options are `info`and `trace`. Default is `info`. -->
83+
84+
- **`--log`** <!-- `<path>` -->:<br>
85+
Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
86+
87+
- **`--progress`**:<br>
88+
Shows a progress bar instead of the default output.
9189

9290
- **`--workers`** `<count>`:<br>
9391
Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
92+
93+
- **`--no-redirect`**:<br>
94+
Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
9495

9596
- **`--retry`** `<attempts>`:<br>
9697
Specifies number of retry attempts for failed downloads.
9798

9899
- **`--delay`** `<seconds>`:<br>
99100
Specifies delay between download requests in seconds. Default is no delay (0).
100101

101-
- **`--limit`** `<count>`:<br>
102-
Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected (with `--cdxinject` or `--auto`), the limit will have no effect.
103-
104102
<!-- - **`--convert-links`**:<br>
105103
If set, all links in the downloaded files will be converted to local links. This is useful for offline browsing. The links are converted to the local path structure. Show output with `--verbosity trace`. -->
106104

107-
**CDX Query Result Handling:**
108-
- **`--cdxbackup`** `<path>`:<br>
109-
Path defaults to output-dir. Saves the result of CDX query as a file. Useful for later downloading snapshots and overcoming refused connections by CDX server due to too many queries. Named as `waybackup_<sanitized_url>.cdx`.
110-
111-
- **`--cdxinject`** `<filepath>`:<br>
112-
Injects a CDX query file to download snapshots. Ensure the query matches the previous `--url` for correct folder structure.
105+
## Special:
106+
107+
- **`--reset`**:
108+
If set, the job will be reset, and any existing `cdx`, `db`, `csv` files will be **deleted**. This allows you to start the job from scratch without considering previously downloaded data.
113109

114-
**Auto:**
115-
- **`--auto`**:<br>
116-
If set, csv, skip and cdxbackup/cdxinject are handled automatically. Keep the files and folders as they are. Otherwise they will not be recognized when restarting a download.
110+
- **`--keep`**:
111+
If set, all files will be kept after the job is finished. This includes the `cdx` and `db` file. Without this argument, they will be deleted if the job finished successfully.
117112

118113
### Examples
119114

120-
Download latest snapshot of all files:<br>
115+
Download the latest snapshot of all available files:<br>
121116
`waybackup -u http://example.com -c`
122117

123-
Download latest snapshot of a specific file:<br>
124-
`waybackup -u http://example.com/subdir/file.html -c`
118+
Download the latest snapshot of a specific file (e.g., a login page):<br>
119+
`waybackup -u http://example.com/login.html -c --explicit`
125120

126-
Download all snapshots sorted per timestamp with a specified range and do not follow redirects:<br>
121+
Download all snapshots within the last 5 years and prevent redirects:<br>
127122
`waybackup -u http://example.com -f -r 5 --no-redirect`
128123

129-
Download all snapshots sorted per timestamp with a specified range and save to a specified folder with 3 workers:<br>
124+
Download all snapshots from a specific range (2020 to December 12, 2022) with 4 workers, and show a progress bar:<br>
125+
`waybackup -u http://example.com -f --start 2020 --end 20221212 --workers 4 --progress`
126+
127+
Download all snapshots and save the output in a specific folder with 3 workers:<br>
130128
`waybackup -u http://example.com -f -r 5 -o /home/user/Downloads/snapshots --workers 3`
131129

132-
Download all snapshots from 2020 to 12th of December 2022 with 4 workers, save a csv and show a progress bar:
133-
`waybackup -u http://example.com -f --start 2020 --end 20221212 --workers 4 --csv --verbosity progress`
130+
Download all snapshots but only images and CSS files, filtering for specific filetypes (jpg, css):<br>
131+
`waybackup -u http://example.com -f --filetype jpg,css`
134132

135-
Download all snapshots and output a json response:<br>
136-
`waybackup -u http://example.com -f --verbosity json`
133+
Download all timestamps but start over and ignore existing progress, log the output, and retry 3 times if any error occurs:<br>
134+
`waybackup -u http://example.com -f --log --retry 3 --reset`
137135

138-
List available snapshots per timestamp without downloading and save a csv file to home folder:<br>
139-
`waybackup -u http://example.com -f -l --csv /home/user/Downloads`
136+
Download the latest snapshot, follow no redirects but keep the database and cdx-file:<br>
137+
`waybackup -u http://example.com -c --no-redirect --keep`
140138

141139
## Output path structure
142140

@@ -175,8 +173,9 @@ your/path/waybackup_snapshots/
175173
...
176174
```
177175

176+
## CSV Output
178177

179-
### Json Response
178+
Each snapshot is stored with the following keys/values. These are either stored in a sqlite database while the download is running or saved into a CSV file after the download is finished.
180179

181180
For download queries:
182181

@@ -212,14 +211,14 @@ For list queries:
212211
]
213212
```
214213

215-
## CSV Output
216-
217-
The csv contains the json response in a table format.
218-
219214
### Debugging
220215

221216
Exceptions will be written into `waybackup_error.log` (each run overwrites the file).
222217

218+
### Known ToDos
219+
220+
- [ ] currently there is no logic to handle if both a http and https version of a page is available
221+
223222
## Contributing
224223

225224
I'm always happy for some feature requests to improve the usability of this tool.

pywaybackup/Arguments.py

+30-38
Original file line numberDiff line numberDiff line change
@@ -22,35 +22,34 @@ def __init__(self):
2222
exclusive_required.add_argument('-s', '--save', action='store_true', help='save a page to the wayback machine')
2323

2424
optional = parser.add_argument_group('optional query parameters')
25-
optional.add_argument('-l', '--list', action='store_true', help='only print snapshots (opt range in y)')
2625
optional.add_argument('-e', '--explicit', action='store_true', help='search only for the explicit given url')
27-
optional.add_argument('-o', '--output', type=str, metavar="", help='output folder - defaults to current directory')
2826
optional.add_argument('-r', '--range', type=int, metavar="", help='range in years to search')
2927
optional.add_argument('--start', type=int, metavar="", help='start timestamp format: YYYYMMDDhhmmss')
3028
optional.add_argument('--end', type=int, metavar="", help='end timestamp format: YYYYMMDDhhmmss')
31-
32-
special = parser.add_argument_group('manipulate behavior')
33-
special.add_argument('--csv', type=str, nargs='?', const=True, metavar='path', help='save a csv file with the json output - defaults to output folder')
34-
special.add_argument('--skip', type=str, nargs='?', const=True, metavar='path', help='skips existing files in the output folder by checking the .csv file - defaults to output folder')
35-
special.add_argument('--no-redirect', action='store_true', help='do not follow redirects by archive.org')
36-
special.add_argument('--verbosity', type=str, default="info", metavar="", help='["progress", "json"] for different output or ["trace"] for very detailed output')
37-
special.add_argument('--log', type=str, nargs='?', const=True, metavar='path', help='save a log file - defaults to output folder')
38-
special.add_argument('--retry', type=int, default=0, metavar="", help='retry failed downloads (opt tries as int, else infinite)')
39-
special.add_argument('--workers', type=int, default=1, metavar="", help='number of workers (simultaneous downloads)')
40-
# special.add_argument('--convert-links', action='store_true', help='Convert all links in the files to local paths. Requires -c/--current')
41-
special.add_argument('--delay', type=int, default=0, metavar="", help='delay between each download in seconds')
42-
special.add_argument('--limit', type=int, nargs='?', const=True, metavar='int', help='limit the number of snapshots to download')
43-
44-
cdx = parser.add_argument_group('cdx (one exclusive)')
45-
exclusive_cdx = cdx.add_mutually_exclusive_group()
46-
exclusive_cdx.add_argument('--cdxbackup', type=str, nargs='?', const=True, metavar='path', help='Save the cdx query-result to a file for recurent use - defaults to output folder')
47-
exclusive_cdx.add_argument('--cdxinject', type=str, nargs='?', const=True, metavar='path', help='Inject a cdx backup-file to download according to the given url')
48-
49-
auto = parser.add_argument_group('auto')
50-
auto.add_argument('--auto', action='store_true', help='includes automatic csv, skip and cdxbackup/cdxinject to resume a stopped download')
29+
optional.add_argument('--filetype', type=str, metavar="", help='filetypes to download comma separated (e.g. "html,css")')
30+
optional.add_argument('--limit', type=int, nargs='?', const=True, metavar='int', help='limit the number of snapshots to download')
31+
32+
behavior = parser.add_argument_group('manipulate behavior')
33+
behavior.add_argument('-o', '--output', type=str, metavar="", help='output folder - defaults to current directory')
34+
behavior.add_argument('--log', action='store_true', help='save a log file into the output folder')
35+
behavior.add_argument('--progress', action='store_true', help='show a progress bar')
36+
behavior.add_argument('--no-redirect', action='store_true', help='do not follow redirects by archive.org')
37+
#behavior.add_argument('--verbosity', type=str, default="info", metavar="", help='verbosity level (info, trace)')
38+
behavior.add_argument('--retry', type=int, default=0, metavar="", help='retry failed downloads (opt tries as int, else infinite)')
39+
behavior.add_argument('--workers', type=int, default=1, metavar="", help='number of workers (simultaneous downloads)')
40+
# behavior.add_argument('--convert-links', action='store_true', help='Convert all links in the files to local paths. Requires -c/--current')
41+
behavior.add_argument('--delay', type=int, default=0, metavar="", help='delay between each download in seconds')
42+
43+
special = parser.add_argument_group('special')
44+
special.add_argument('--reset', action='store_true', help='reset the job and ignore existing cdx/db/csv files')
45+
special.add_argument('--keep', action='store_true', help='keep all files after the job finished')
5146

5247
args = parser.parse_args(args=None if sys.argv[1:] else ['--help']) # if no arguments are given, print help
5348

49+
required_args = {action.dest: getattr(args, action.dest) for action in exclusive_required._group_actions}
50+
optional_args = {action.dest: getattr(args, action.dest) for action in optional._group_actions}
51+
args.query_identifier = str(args.url) + str(required_args) + str(optional_args)
52+
5453
# if args.convert_links and not args.current:
5554
# parser.error("--convert-links can only be used with the -c/--current option")
5655

@@ -84,21 +83,14 @@ def init(cls):
8483
if cls.current:
8584
cls.mode = "current"
8685

87-
cls.cdxbackup = cls.output if cls.cdxbackup is None else cls.cdxbackup
88-
89-
if cls.auto:
90-
cls.skip = cls.output
91-
cls.csv = cls.output
92-
cls.cdxbackup = cls.output
93-
cls.cdxinject = os.path.join(cls.output, f"waybackup_{sanitize_filename(cls.url)}.cdx")
94-
else:
95-
if cls.skip is True:
96-
cls.skip = cls.output
97-
if cls.csv is True:
98-
cls.csv = cls.output
99-
if cls.cdxbackup is True:
100-
cls.cdxbackup = cls.output
101-
if cls.cdxinject is True:
102-
cls.cdxinject = cls.output
86+
if cls.filetype:
87+
cls.filetype = [ft.lower().strip() for ft in cls.filetype.split(",")]
10388

89+
cls.cdxfile = os.path.join(cls.output, f"waybackup_{sanitize_filename(cls.url)}.cdx")
90+
cls.dbfile = os.path.join(cls.output, f"waybackup_{sanitize_filename(cls.url)}.db")
91+
cls.csvfile = os.path.join(cls.output, f"waybackup_{sanitize_filename(cls.url)}.csv")
10492

93+
if cls.reset:
94+
os.remove(cls.cdxfile) if os.path.isfile(cls.cdxfile) else None
95+
os.remove(cls.dbfile) if os.path.isfile(cls.dbfile) else None
96+
os.remove(cls.csvfile) if os.path.isfile(cls.csvfile) else None

pywaybackup/Converter.py

-1
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,6 @@
44
from pywaybackup.helper import url_split
55

66
from pywaybackup.Arguments import Configuration as config
7-
from pywaybackup.Verbosity import Verbosity as vb
87
import re
98

109
class Converter:

0 commit comments

Comments
 (0)