Skip to content

Commit c50269d

Browse files
committed
Merge branch 'r/1.0.0'
1 parent 0156def commit c50269d

File tree

5 files changed

+45
-38
lines changed

5 files changed

+45
-38
lines changed

README.md

Lines changed: 20 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@
22

33
[![PyPI](https://img.shields.io/pypi/v/pywaybackup)](https://pypi.org/project/pywaybackup/)
44
[![PyPI - Downloads](https://img.shields.io/pypi/dm/pywaybackup)](https://pypi.org/project/pywaybackup/)
5-
![Release](https://img.shields.io/badge/Release-beta-orange)
65
![Python Version](https://img.shields.io/badge/Python-3.6-blue)
76
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
87

@@ -31,43 +30,41 @@ Internet-archive is a nice source for several OSINT-information. This script is
3130

3231
This script allows you to download content from the Wayback Machine (archive.org). You can use it to download either the latest version or all versions of web page snapshots within a specified range.
3332

34-
<!-- ## Info -->
35-
3633
### Arguments
3734

3835
- `-h`, `--help`: Show the help message and exit.
3936
- `-a`, `--about`: Show information about the script and exit.
4037

4138
#### Required Arguments
4239

43-
- `-u URL`, `--url URL`: The URL of the web page to download. This argument is required.
40+
- `-u`, `--url`: The URL of the web page to download. This argument is required.
4441

4542
#### Mode Selection (Choose One)
4643

47-
- `-c`, `--current`: Download the latest version of each file snapshot. You will get a rebuild of the current website with all available files.
44+
- `-c`, `--current`: Download the latest version of each file snapshot. You will get a rebuild of the current website with all available files (but not any original state because new and old versions are mixed).
4845
- `-f`, `--full`: Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
4946
- `-s`, `--save`: Save a page to the Wayback Machine. (beta)
5047

5148
#### Optional Arguments
5249

5350
- `-l`, `--list`: Only print the snapshots available within the specified range. Does not download the snapshots.
5451
- `-e`, `--explicit`: Only download the explicit given url. No wildcard subdomains or paths.
55-
- `-o OUTPUT`, `--output OUTPUT`: The folder where downloaded files will be saved.
52+
- `-o`, `--output`: The folder where downloaded files will be saved.
5653

5754
- **Range Selection:**<br>
5855
Specify the range in years or a specific timestamp either start, end or both. If you specify the `range` argument, the `start` and `end` arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>
5956
(year 2019, year+month 201901, year+month+day 20190101, year+month+day+hour 2019010112)
60-
- `-r RANGE`, `--range RANGE`: Specify the range in years for which to search and download snapshots.
57+
- `-r`, `--range`: Specify the range in years for which to search and download snapshots.
6158
- `--start`: Timestamp to start searching.
6259
- `--end`: Timestamp to end searching.
6360

6461
#### Additional
6562

66-
- `--csv`: Save a csv file with the list of snapshots inside the output folder.
63+
- `--csv`: Save a csv file with the list of snapshots inside the output folder or a specified folder. If you set `--list` the csv will contain the cdx list of snapshots. If you set either `--current` or `--full` the csv will contain the downloaded files.
6764
- `--no-redirect`: Do not follow redirects of snapshots. Archive.org sometimes redirects to a different snapshot for several reasons. Downloading redirects may lead to timestamp-folders which contain some files with a different timestamp. This does not matter if you only want to download the latest version (`-c`).
68-
- `--verbosity [LEVEL]`: Set the verbosity: json (print json response), progress (show progress bar) or standard (default).
69-
- `--retry [RETRY_FAILED]`: Retry failed downloads. You can specify the number of retry attempts as an integer.
70-
- `--worker [AMOUNT]`: The number of worker to use for downloading (simultaneous downloads). Default is 1. A safe spot is about 10 workers. Beware: Using too many worker will lead into refused connections from the Wayback Machine. Duration about 1.5 minutes.
65+
- `--verbosity`: Set the verbosity: json (print json response), progress (show progress bar).
66+
- `--retry`: Retry failed downloads. You can specify the number of retry attempts as an integer.
67+
- `--workers`: The number of workers to use for downloading (simultaneous downloads). Default is 1. A safe spot is about 10 workers. Beware: Using too many workers will lead into refused connections from the Wayback Machine. Duration about 1.5 minutes.
7168

7269
### Examples
7370

@@ -77,14 +74,20 @@ Download latest snapshot of all files:<br>
7774
Download latest snapshot of all files with retries:<br>
7875
`waybackup -u http://example.com -c --retry 3`
7976

80-
Download all snapshots sorted per timestamp with a specified range and follow redirects:<br>
81-
`waybackup -u http://example.com -f -r 5 --redirect`
77+
Download all snapshots sorted per timestamp with a specified range and do not follow redirects:<br>
78+
`waybackup -u http://example.com -f -r 5 --no-redirect`
79+
80+
Download all snapshots sorted per timestamp with a specified range and save to a specified folder with 3 workers:<br>
81+
`waybackup -u http://example.com -f -r 5 -o /home/user/Downloads/snapshots --workers 3`
82+
83+
Download all snapshots from 2020 to 12th of December 2022 with 4 workers, save a csv and show a progress bar:
84+
`waybackup -u http://example.com -f --start 2020 --end 20221212 --workers 4 --csv --verbosity progress`
8285

83-
Download all snapshots sorted per timestamp with a specified range and save to a specified folder with 3 worker:<br>
84-
`waybackup -u http://example.com -f -r 5 -o /home/user/Downloads/snapshots --worker 3`
86+
Download all snapshots and output a json response:<br>
87+
`waybackup -u http://example.com -f --verbosity json`
8588

86-
List available snapshots per timestamp without downloading:<br>
87-
`waybackup -u http://example.com -f -l`
89+
List available snapshots per timestamp without downloading and save a csv file to home folder:<br>
90+
`waybackup -u http://example.com -f -l --csv /home/user/Downloads`
8891

8992
## Contributing
9093

pywaybackup/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.8.1"
1+
__version__ = "1.0.0"

pywaybackup/archive.py

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -107,17 +107,17 @@ def query_list(url: str, range: int, start: int, end: int, explicit: bool, mode:
107107

108108

109109
# example download: http://web.archive.org/web/20190815104545id_/https://www.google.com/
110-
def download_list(output, retry, no_redirect, worker):
110+
def download_list(output, retry, no_redirect, workers):
111111
"""
112112
Download a list of urls in format: [{"timestamp": "20190815104545", "url": "https://www.google.com/"}]
113113
"""
114114
if sc.count_list() == 0:
115115
v.write("\nNothing to download");
116116
return
117117
v.write("\nDownloading snapshots...", progress=0)
118-
if worker > 1:
119-
v.write(f"\n-----> Simultaneous downloads: {worker}")
120-
batch_size = sc.count_list() // worker + 1
118+
if workers > 1:
119+
v.write(f"\n-----> Simultaneous downloads: {workers}")
120+
batch_size = sc.count_list() // workers + 1
121121
else:
122122
batch_size = sc.count_list()
123123
sc.create_collection()
@@ -126,7 +126,7 @@ def download_list(output, retry, no_redirect, worker):
126126
worker = 0
127127
for batch in batch_list:
128128
worker += 1
129-
thread = threading.Thread(target=download_loop, args=(batch, output, worker, retry, no_redirect))
129+
thread = threading.Thread(target=download_loop, args=(batch, output, workers, retry, no_redirect))
130130
threads.append(thread)
131131
thread.start()
132132
for thread in threads:
@@ -256,15 +256,18 @@ def parse_response_code(response_code: int):
256256
return RESPONSE_CODE_DICT[response_code]
257257
return "Unknown response code"
258258

259-
def save_csv(csv_path: str):
259+
def save_csv(csv_path: str, url: str):
260260
"""
261261
Write a CSV file with the list of snapshots.
262262
"""
263263
import csv
264+
disallowed = ['<', '>', ':', '"', '/', '\\', '|', '?', '*']
265+
for char in disallowed:
266+
url = url.replace(char, '.')
264267
if sc.count_list() > 0:
265268
v.write("\nSaving CSV file...")
266269
os.makedirs(os.path.abspath(csv_path), exist_ok=True)
267-
with open(os.path.join(csv_path, "waybackup.csv"), mode='w') as file:
270+
with open(os.path.join(csv_path, f"waybackup_{url}.csv"), mode='w') as file:
268271
row = csv.DictWriter(file, sc.SNAPSHOT_COLLECTION[0].keys())
269272
row.writeheader()
270273
for snapshot in sc.SNAPSHOT_COLLECTION:

pywaybackup/arguments.py

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
import sys
12
import argparse
23
from pywaybackup.__version__ import __version__
34

@@ -7,7 +8,7 @@ def parse():
78
parser.add_argument('-a', '--about', action='version', version='%(prog)s ' + __version__ + ' by @bitdruid -> https://github.com/bitdruid')
89

910
required = parser.add_argument_group('required')
10-
required.add_argument('-u', '--url', type=str, help='URL to use')
11+
required.add_argument('-u', '--url', type=str, metavar="", help='URL to use')
1112
exclusive_required = required.add_mutually_exclusive_group(required=True)
1213
exclusive_required.add_argument('-c', '--current', action='store_true', help='Download the latest version of each file snapshot (opt range in y)')
1314
exclusive_required.add_argument('-f', '--full', action='store_true', help='Download snapshots of all timestamps (opt range in y)')
@@ -16,18 +17,18 @@ def parse():
1617
optional = parser.add_argument_group('optional')
1718
optional.add_argument('-l', '--list', action='store_true', help='Only print snapshots (opt range in y)')
1819
optional.add_argument('-e', '--explicit', action='store_true', help='Search only for the explicit given url')
19-
optional.add_argument('-o', '--output', type=str, help='Output folder defaults to current directory')
20-
optional.add_argument('-r', '--range', type=int, help='Range in years to search')
21-
optional.add_argument('--start', type=int, help='Start timestamp format: YYYYMMDDhhmmss')
22-
optional.add_argument('--end', type=int, help='End timestamp format: YYYYMMDDhhmmss')
20+
optional.add_argument('-o', '--output', type=str, metavar="", help='Output folder defaults to current directory')
21+
optional.add_argument('-r', '--range', type=int, metavar="", help='Range in years to search')
22+
optional.add_argument('--start', type=int, metavar="", help='Start timestamp format: YYYYMMDDhhmmss')
23+
optional.add_argument('--end', type=int, metavar="", help='End timestamp format: YYYYMMDDhhmmss')
2324

2425
special = parser.add_argument_group('special')
25-
special.add_argument('--csv', type=str, nargs='?', const=True, help='Save a csv file on a given path or defaults to the output folder')
26+
special.add_argument('--csv', type=str, nargs='?', metavar='', help='Save a csv file on a given path or defaults to the output folder')
2627
special.add_argument('--no-redirect', action='store_true', help='Do not follow redirects by archive.org')
27-
special.add_argument('--verbosity', type=str, default="standard", choices=["standard", "progress", "json"], help='Verbosity level')
28-
special.add_argument('--retry', type=int, default=0, metavar="X-TIMES", help='Retry failed downloads (opt tries as int, else infinite)')
29-
special.add_argument('--worker', type=int, default=1, metavar="AMOUNT", help='Number of worker (simultaneous downloads)')
28+
special.add_argument('--verbosity', type=str, default="standard", metavar="", help='["progress", "json"] Verbosity level')
29+
special.add_argument('--retry', type=int, default=0, metavar="", help='Retry failed downloads (opt tries as int, else infinite)')
30+
special.add_argument('--workers', type=int, default=1, metavar="", help='Number of workers (simultaneous downloads)')
3031

31-
args = parser.parse_args()
32+
args = parser.parse_args(args=None if sys.argv[1:] else ['--help']) # if no arguments are given, print help
3233

3334
return args

pywaybackup/main.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ def main():
1515

1616
if args.output is None:
1717
args.output = os.path.join(os.getcwd(), "waybackup_snapshots")
18-
if args.csv is True:
18+
if args.csv == "":
1919
args.csv = args.output
2020

2121
if args.save:
@@ -25,9 +25,9 @@ def main():
2525
if args.list:
2626
archive.print_list()
2727
else:
28-
archive.download_list(args.output, args.retry, args.no_redirect, args.worker)
28+
archive.download_list(args.output, args.retry, args.no_redirect, args.workers)
2929
if args.csv:
30-
archive.save_csv(args.csv)
30+
archive.save_csv(args.csv, args.url)
3131
v.close()
3232

3333
if __name__ == "__main__":

0 commit comments

Comments
 (0)