Skip to content

Commit b19b22a

Browse files
committed
Merge branch 'r/1.4.0'
1 parent a9a2027 commit b19b22a

11 files changed

+644
-266
lines changed

README.md

+41-20
Original file line numberDiff line numberDiff line change
@@ -11,10 +11,6 @@ Internet-archive is a nice source for several OSINT-information. This tool is a
1111

1212
This tool allows you to download content from the Wayback Machine (archive.org). You can use it to download either the latest version or all versions of web page snapshots within a specified range.
1313

14-
## Info
15-
16-
Linux recommended: On windows machines, the path length is limited. It can only be overcome by editing the registry. Files which exceed the path length will not be downloaded.
17-
1814
## Installation
1915

2016
### Pip
@@ -32,52 +28,77 @@ Linux recommended: On windows machines, the path length is limited. It can only
3228
```pip install .```
3329
- in a virtual env or use `--break-system-package`
3430

31+
## Usage infos
32+
33+
- Linux recommended: On Windows machines, the path length is limited. This can only be overcome by editing the registry. Files that exceed the path length will not be downloaded.
34+
- If you query an explicit file (e.g. a query-string `?query=this` or `login.html`), the `--explicit`-argument is recommended as a wildcard query may lead to an empty result.
35+
3536
## Arguments
3637

3738
- `-h`, `--help`: Show the help message and exit.
3839
- `-a`, `--about`: Show information about the tool and exit.
3940

4041
### Required
4142

42-
- `-u`, `--url`: The URL of the web page to download. This argument is required.
43+
- **`-u`**, **`--url`**:<br>
44+
The URL of the web page to download. This argument is required.
4345

4446
#### Mode Selection (Choose One)
45-
- `-c`, `--current`: Download the latest version of each file snapshot. You will get a rebuild of the current website with all available files (but not any original state because new and old versions are mixed).
46-
- `-f`, `--full`: Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
47-
- `-s`, `--save`: Save a page to the Wayback Machine. (beta)
47+
- **`-c`**, **`--current`**:<br>
48+
Download the latest version of each file snapshot. You will get a rebuild of the current website with all available files (but not any original state because new and old versions are mixed).
49+
- **`-f`**, **`--full`**:<br>
50+
Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
51+
- **`-s`**, **`--save`**:<br>
52+
Save a page to the Wayback Machine. (beta)
4853

4954
### Optional query parameters
5055

51-
- `-l`, `--list`: Only print the snapshots available within the specified range. Does not download the snapshots.
52-
- `-e`, `--explicit`: Only download the explicit given url. No wildcard subdomains or paths. Use e.g. to get root-only snapshots.
53-
- `-o`, `--output`: Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
56+
- **`-l`**, **`--list`**:<br>
57+
Only print the snapshots available within the specified range. Does not download the snapshots.
58+
- **`-e`**, **`--explicit`**:<br>
59+
Only download the explicit given URL. No wildcard subdomains or paths. Use e.g. to get root-only snapshots. This is recommended for explicit files like `login.html` or `?query=this`.
60+
- **`-o`**, **`--output`**:<br>
61+
Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
5462

5563
- **Range Selection:**<br>
56-
Specify the range in years or a specific timestamp either start, end or both. If you specify the `range` argument, the `start` and `end` arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>
57-
(year 2019, year+month 201901, year+month+day 20190101, year+month+day+hour 2019010112)
58-
- `-r`, `--range`: Specify the range in years for which to search and download snapshots.
59-
- `--start`: Timestamp to start searching.
60-
- `--end`: Timestamp to end searching.
64+
Specify the range in years or a specific timestamp either start, end, or both. If you specify the `range` argument, the `start` and `end` arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>
65+
(year 2019, year+month 201901, year+month+day 20190101, year+month+day+hour 2019010112)
66+
- **`-r`**, **`--range`**:<br>
67+
Specify the range in years for which to search and download snapshots.
68+
- **`--start`**:<br>
69+
Timestamp to start searching.
70+
- **`--end`**:<br>
71+
Timestamp to end searching.
6172

6273
### Additional behavior manipulation
6374

6475
- **`--csv`** `<path>`:<br>
6576
Path defaults to output-dir. Saves a CSV file with the json-response for successfull downloads. If `--list` is set, the CSV contains the CDX list of snapshots. If `--current` or `--full` is set, CSV contains downloaded files. Named as `waybackup_<sanitized_url>.csv`.
6677

6778
- **`--skip`** `<path>`:<br>
68-
Path defaults to output-dir. Checks for an existing `waybackup_<domain>.csv` for URLs to skip downloading. Useful for interrupted downloads. Files are checked by their root-domain, ensuring consistency across queries. This means that if you download `http://example.com/subdir1/` and later `http://example.com`, the second query will skip the first path.
79+
Path defaults to output-dir. Checks for an existing `waybackup_<sanitized_url>.csv` for URLs to skip downloading. Useful for interrupted downloads. Files are checked by their root-domain, ensuring consistency across queries. This means that if you download `http://example.com/subdir1/` and later `http://example.com`, the second query will skip the first path.
6980

7081
- **`--no-redirect`**:<br>
7182
Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
7283

7384
- **`--verbosity`** `<level>`:<br>
7485
Sets verbosity level. Options are `json` (prints JSON response) or `progress` (shows progress bar).
86+
<!-- Alternatively set verbosity level to `trace` for a very detailed output. -->
87+
88+
- **`--log`** `<path>`:<br>
89+
Path defaults to output-dir. Saves a log file with the output of the tool. Named as `waybackup_<sanitized_url>.log`.
90+
91+
- **`--workers`** `<count>`:<br>
92+
Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
7593

7694
- **`--retry`** `<attempts>`:<br>
7795
Specifies number of retry attempts for failed downloads.
78-
79-
- **`--workers`** `<count>`:<br>
80-
Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
96+
97+
- **`--delay`** `<seconds>`:<br>
98+
Specifies delay between download requests in seconds. Default is no delay (0).
99+
100+
<!-- - **`--convert-links`**:<br>
101+
If set, all links in the downloaded files will be converted to local links. This is useful for offline browsing. The links are converted to the local path structure. Show output with `--verbosity trace`. -->
81102

82103
**CDX Query Handling:**
83104
- **`--cdxbackup`** `<path>`:<br>

pywaybackup/Arguments.py

+102
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
2+
import sys
3+
import os
4+
import argparse
5+
6+
from pywaybackup.helper import url_split, sanitize_filename
7+
8+
from pywaybackup.__version__ import __version__
9+
10+
class Arguments:
11+
12+
def __init__(self):
13+
14+
parser = argparse.ArgumentParser(description='Download from wayback machine (archive.org)')
15+
parser.add_argument('-a', '--about', action='version', version='%(prog)s ' + __version__ + ' by @bitdruid -> https://github.com/bitdruid')
16+
parser.add_argument('-d', '--debug', action='store_true', help='Debug mode (Always full traceback and creates an error.log')
17+
18+
required = parser.add_argument_group('required (one exclusive)')
19+
required.add_argument('-u', '--url', type=str, metavar="", help='url (with subdir/subdomain) to download')
20+
exclusive_required = required.add_mutually_exclusive_group(required=True)
21+
exclusive_required.add_argument('-c', '--current', action='store_true', help='download the latest version of each file snapshot')
22+
exclusive_required.add_argument('-f', '--full', action='store_true', help='download snapshots of all timestamps')
23+
exclusive_required.add_argument('-s', '--save', action='store_true', help='save a page to the wayback machine')
24+
25+
optional = parser.add_argument_group('optional query parameters')
26+
optional.add_argument('-l', '--list', action='store_true', help='only print snapshots (opt range in y)')
27+
optional.add_argument('-e', '--explicit', action='store_true', help='search only for the explicit given url')
28+
optional.add_argument('-o', '--output', type=str, metavar="", help='output folder - defaults to current directory')
29+
optional.add_argument('-r', '--range', type=int, metavar="", help='range in years to search')
30+
optional.add_argument('--start', type=int, metavar="", help='start timestamp format: YYYYMMDDhhmmss')
31+
optional.add_argument('--end', type=int, metavar="", help='end timestamp format: YYYYMMDDhhmmss')
32+
33+
special = parser.add_argument_group('manipulate behavior')
34+
special.add_argument('--csv', type=str, nargs='?', const=True, metavar='path', help='save a csv file with the json output - defaults to output folder')
35+
special.add_argument('--skip', type=str, nargs='?', const=True, metavar='path', help='skips existing files in the output folder by checking the .csv file - defaults to output folder')
36+
special.add_argument('--no-redirect', action='store_true', help='do not follow redirects by archive.org')
37+
special.add_argument('--verbosity', type=str, default="info", metavar="", help='["progress", "json"] for different output or ["trace"] for very detailed output')
38+
special.add_argument('--log', type=str, nargs='?', const=True, metavar='path', help='save a log file - defaults to output folder')
39+
special.add_argument('--retry', type=int, default=0, metavar="", help='retry failed downloads (opt tries as int, else infinite)')
40+
special.add_argument('--workers', type=int, default=1, metavar="", help='number of workers (simultaneous downloads)')
41+
# special.add_argument('--convert-links', action='store_true', help='Convert all links in the files to local paths. Requires -c/--current')
42+
special.add_argument('--delay', type=int, default=0, metavar="", help='delay between each download in seconds')
43+
44+
cdx = parser.add_argument_group('cdx (one exclusive)')
45+
exclusive_cdx = cdx.add_mutually_exclusive_group()
46+
exclusive_cdx.add_argument('--cdxbackup', type=str, nargs='?', const=True, metavar='path', help='Save the cdx query-result to a file for recurent use - defaults to output folder')
47+
exclusive_cdx.add_argument('--cdxinject', type=str, nargs='?', const=True, metavar='path', help='Inject a cdx backup-file to download according to the given url')
48+
49+
auto = parser.add_argument_group('auto')
50+
auto.add_argument('--auto', action='store_true', help='includes automatic csv, skip and cdxbackup/cdxinject to resume a stopped download')
51+
52+
args = parser.parse_args(args=None if sys.argv[1:] else ['--help']) # if no arguments are given, print help
53+
54+
# if args.convert_links and not args.current:
55+
# parser.error("--convert-links can only be used with the -c/--current option")
56+
57+
self.args = args
58+
59+
def get_args(self):
60+
return self.args
61+
62+
class Configuration:
63+
64+
@classmethod
65+
def init(cls):
66+
67+
cls.args = Arguments().get_args()
68+
for key, value in vars(cls.args).items():
69+
setattr(Configuration, key, value)
70+
71+
# args now attributes of Configuration // Configuration.output, ...
72+
cls.command = ' '.join(sys.argv[1:])
73+
cls.domain, cls.subdir, cls.filename = url_split(cls.url)
74+
75+
if cls.output is None:
76+
cls.output = os.path.join(os.getcwd(), "waybackup_snapshots")
77+
os.makedirs(cls.output, exist_ok=True)
78+
79+
if cls.log is True:
80+
cls.log = os.path.join(cls.output, f"waybackup_{sanitize_filename(cls.url)}.log")
81+
82+
if cls.full:
83+
cls.mode = "full"
84+
if cls.current:
85+
cls.mode = "current"
86+
87+
if cls.auto:
88+
cls.skip = cls.output
89+
cls.csv = cls.output
90+
cls.cdxbackup = cls.output
91+
cls.cdxinject = os.path.join(cls.output, f"waybackup_{sanitize_filename(cls.url)}.cdx")
92+
else:
93+
if cls.skip is True:
94+
cls.skip = cls.output
95+
if cls.csv is True:
96+
cls.csv = cls.output
97+
if cls.cdxbackup is True:
98+
cls.cdxbackup = cls.output
99+
if cls.cdxinject is True:
100+
cls.cdxinject = cls.output
101+
102+

pywaybackup/Converter.py

+182
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,182 @@
1+
import os
2+
import errno
3+
import magic
4+
from pywaybackup.helper import url_split
5+
6+
from pywaybackup.Arguments import Configuration as config
7+
from pywaybackup.Verbosity import Verbosity as vb
8+
import re
9+
10+
class Converter:
11+
12+
@classmethod
13+
def define_root_steps(cls, filepath) -> str:
14+
"""
15+
Define the steps (../) to the root directory.
16+
"""
17+
abs_path = os.path.abspath(filepath)
18+
webroot_path = os.path.abspath(f"{config.output}/{config.domain}/") # webroot is the domain folder in the output
19+
# common path between the two
20+
common_path = os.path.commonpath([abs_path, webroot_path])
21+
# steps up to the common path
22+
rel_path_from_common = os.path.relpath(abs_path, common_path)
23+
steps_up = rel_path_from_common.count(os.path.sep)
24+
if steps_up <= 1: # if the file is in the root of the domain
25+
return "./"
26+
return "../" * steps_up
27+
28+
29+
30+
31+
32+
@classmethod
33+
def links(cls, filepath, status_message=None):
34+
"""
35+
Convert all links in a HTML / CSS / JS file to local paths.
36+
"""
37+
38+
39+
def extract_urls(content) -> list:
40+
"""
41+
Extract all links from a file.
42+
"""
43+
44+
#content = re.sub(r'\s+', '', content)
45+
#content = re.sub(r'\n', '', content)
46+
47+
html_types = ["src", "href", "poster", "data-src"]
48+
css_types = ["url"]
49+
links = []
50+
for html_type in html_types:
51+
# possible formatings of the value: "url", 'url', url
52+
matches = re.findall(f"{html_type}=[\"']?([^\"'>]+)", content)
53+
links += matches
54+
for css_type in css_types:
55+
# possible formatings of the value: url(url) url('url') url("url") // ends with )
56+
matches = re.findall(rf"{css_type}\((['\"]?)([^'\"\)]+)\1\)", content)
57+
links += [match[1] for match in matches]
58+
links = list(set(links))
59+
return links
60+
61+
62+
def local_url(original_url, domain, count) -> str:
63+
"""
64+
Convert a given url to a local path.
65+
"""
66+
original_url_domain = url_split(original_url)[0]
67+
68+
# check if the url is external or internal (external is returned as is because no need to convert)
69+
external = False
70+
if original_url_domain != domain:
71+
if "://" in original_url:
72+
external = True
73+
if original_url.startswith("//"):
74+
external = True
75+
if external:
76+
status_message.trace(status="", type=f"{count}/{len(links)}", message="External url")
77+
return original_url
78+
79+
# convert the url to a relative path to the local root (download dir) if it's a valid path, else return the original url
80+
original_url_file = os.path.join(config.output, config.domain, normalize_url(original_url))
81+
if validate_path(original_url_file):
82+
if original_url.startswith("/"): # if only starts with /
83+
original_url = f"{cls.define_root_steps(filepath)}{original_url.lstrip('/')}"
84+
if original_url.startswith(".//"):
85+
original_url = f"{cls.define_root_steps(filepath)}{original_url.lstrip('./')}"
86+
if original_url_domain == domain: # if url is like https://domain.com/path/to/file
87+
original_url = f"{cls.define_root_steps(filepath)}{original_url.split(domain)[1].lstrip('/')}"
88+
if original_url.startswith("../"): # if file is already ../ check if it's not too many steps up
89+
original_url = f"{cls.define_root_steps(filepath)}{original_url.split('../')[-1].lstrip('/')}"
90+
else:
91+
status_message.trace(status="", type="", message=f"{count}/{len(links)}: URL is not a valid path")
92+
93+
return original_url
94+
95+
96+
97+
98+
99+
def normalize_url(url) -> str:
100+
"""
101+
Normalize a given url by removing it's protocol, domain and parent directorie references.
102+
103+
Example1:
104+
- Example input: https://domain.com/path/to/file
105+
- Example output: /path/to/file
106+
107+
Example2
108+
- input: ../path/to/file
109+
- output: /path/to/file
110+
"""
111+
try:
112+
url = "/" + url.split("../")[-1]
113+
except IndexError:
114+
pass
115+
if url.startswith("//"):
116+
url = "/" + url.split("//")[1]
117+
parsed_url = url_split(url)
118+
return f"{parsed_url[1]}/{parsed_url[2]}"
119+
120+
121+
def is_pathname_valid(pathname: str) -> bool:
122+
"""
123+
Check if a given pathname is valid.
124+
"""
125+
if not isinstance(pathname, str) or not pathname:
126+
return False
127+
128+
try:
129+
os.lstat(pathname)
130+
except OSError as exc:
131+
if exc.errno == errno.ENOENT:
132+
return True
133+
elif exc.errno in {errno.ENAMETOOLONG, errno.ERANGE}:
134+
return False
135+
return True
136+
137+
def is_path_creatable(pathname: str) -> bool:
138+
"""
139+
Check if a given path is creatable.
140+
"""
141+
dirname = os.path.dirname(pathname) or os.getcwd()
142+
return os.access(dirname, os.W_OK)
143+
144+
def is_path_exists_or_creatable(pathname: str) -> bool:
145+
"""
146+
Check if a given path exists or is creatable.
147+
"""
148+
return is_pathname_valid(pathname) or is_path_creatable(pathname)
149+
150+
def validate_path(filepath: str) -> bool:
151+
"""
152+
Validate if a given path can exist.
153+
"""
154+
return is_path_exists_or_creatable(filepath)
155+
156+
157+
158+
159+
160+
if os.path.isfile(filepath):
161+
if magic.from_file(filepath, mime=True).split("/")[1] == "javascript":
162+
status_message.trace(status="Error", type="", message="JS-file is not supported")
163+
return
164+
try:
165+
with open(filepath, "r") as file:
166+
domain = config.domain
167+
content = file.read()
168+
links = extract_urls(content)
169+
status_message.store(message=f"\n-----> Convert: [{len(links)}] links in file")
170+
count = 1
171+
for original_link in links:
172+
status_message.trace(status="ORIG", type=f"{count}/{len(links)}", message=original_link)
173+
new_link = local_url(original_link, domain, count)
174+
if new_link != original_link:
175+
status_message.trace(status="CONV", type=f"{count}/{len(links)}", message=new_link)
176+
content = content.replace(original_link, new_link)
177+
count += 1
178+
file = open(filepath, "w")
179+
file.write(content)
180+
file.close()
181+
except UnicodeDecodeError:
182+
status_message.trace(status="Error", type="", message="Could not decode file to convert links")

0 commit comments

Comments
 (0)