You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As before, these filters can still be used with other commands, like `stream`, or `export mirror`, etc.
50
+
51
+
`--root-*` options of `export mirror` now use the same syntax and machinery as the normal input filters.
52
+
53
+
Also, the overall filtering semantics changed a bit.
54
+
The top-level logical expression the filters compute is now a large conjunction.
55
+
I.e. the above example now compiles to, a bit simplified, `(response.method == "GET" or response.method == "DOM") and re.match(".200C", status) and (response_mime == "text/html") and re.match("\\bPotter\\b", response.body)`.
56
+
57
+
- Added a bunch of new `--output` formats.
58
+
Mostly, this adds a bunch of output formats that refer to `stime`s.
59
+
Mainly, to simplify `export mirror --all` usage, described below.
60
+
61
+
- `export mirror`:
62
+
63
+
- Implemented exporting of different URL visits.
64
+
65
+
I.e., you can now export not just `--latest` visit to each URL, but an `--oldest` one, or one `--nearest` to a given date, or `--all` of them.
66
+
67
+
- Implemented `--latest-hybrid`, `--oldest-hybrid`, and `--nearest-hybrid` options.
68
+
69
+
These allow you to export each page with resource requisites that are date-vise closest to the `stime` of the page itself, instead of taking globally `--latest`, `--oldest`, or `--nearest` versions of all requisite URLs.
70
+
71
+
At the moment, this takes a lot more memory, but makes the results much more consistent for websites that do not use versioned resource requisites.
72
+
73
+
- Implemented `--hardlink` and `--symlink` options, which allow exporting into content-addressed destinations.
74
+
75
+
I.e. `export mirror --hardlink` will render and write each exported file to `<--to>/_content/<hash/based/path>.<ext>` and only then hardlink the result to `<--to>/<output/format/based/path>.<ext>` target destination.
76
+
And similarly for`--symlink`.
77
+
78
+
Typically, doing this saves quite a bit of space, e.g., when pages refer to the same resource requisites by slightly different URLs, same images and fonts get distributed via different CDN hosts, when you export`--all` visits to some URLs and many of those are absolutely identical, etc.
79
+
80
+
So, from now on, `--hardlink` is the default.
81
+
The old behavior can be archived by running it with `--copy` instead.
82
+
83
+
- Implemented `--relative` and `--absolute` options, which control if URLs should be remapped to relative or absolute `file:` URLs, respectively.
84
+
85
+
- Documented all the new things.
86
+
87
+
- Added a bunch of new `test-cli.sh` tests.
88
+
89
+
### Changed
90
+
91
+
- `export mirror`:
92
+
93
+
- Switched default `--output` to `hupq_n` to prevent collisions when using `--*-hybrid` and `--all`.
94
+
95
+
- Improved handling of `base``HTML` tags, `_target`s are supported now.
96
+
97
+
- Links that reference a page from itself will no longer refer to the page's filename, even when the link has no `fragment`.
98
+
99
+
The results can be a bit confusing, but this makes the new content de-duplication options much more effective.
100
+
101
+
- Made `export mirror` default filters explicit and changed them from `--method "GET" --status-re ".200C"` to `--method "GET" --method "DOM" --status-re ".200C"`.
102
+
103
+
- Implemented `--ignore-bad-inputs` and `--index-all-inputs` options to allow you to change the above default.
104
+
105
+
- Improved output log format.
106
+
107
+
- Improved file loading performance a bit.
108
+
109
+
- Improved documentation.
110
+
111
+
### Fixed
112
+
113
+
- Added a bunch of new tests for `organize`, which cover the `organize --symlink --latest` bug of `tool-v0.18.0`.
114
+
Won't happen again.
115
+
116
+
- Fixed a couple of silly filtering-related bugs.
117
+
9
118
## [tool-v0.18.1] - 2024-11-30: Hotfixes
10
119
11
120
### Fixed
@@ -1700,6 +1809,7 @@ All planned features are complete now.
@@ -1781,16 +1891,18 @@ All planned features are complete now.
1781
1891
1782
1892
## `hoardy-web` tool
1783
1893
1784
-
- `scrub`:
1894
+
- `export mirror`, `scrub`:
1785
1895
- Handle SRI things.
1786
1896
- Handle CSP things.
1787
1897
- `export mirror`:
1788
1898
- Implement `export mirror --standalone`, which would inline all resources into each exported page, a-la `SingleFile`.
1789
-
- `*`:
1899
+
- `organize`:
1790
1900
- Implement automatic discernment of relatedness of `WRR` files (by URLs and similarity) and packing of related files into `WRR` bundles.
1791
1901
- Maybe: Implement data de-duplication between `WRR` files.
1792
-
- Implement `un206` command, which would reassemble a bunch of `GET 206``WRR` files into a single `GET 200``WRR` file.
1902
+
- Implement `un206` command/option, which would reassemble a bunch of `GET 206``WRR` files into a single `GET 200``WRR` file.
1793
1903
- `export mirror`, `organize`:
1904
+
- Allow unloading and lazy re-loading of reqres loaded from anything other than separate `WRR` files.
1905
+
The fact that this is not possible at the moment makes memory consumption in those cases rather abysmal.
1794
1906
- Implement on-the-fly mangling of reqres, so that, e.g. you could `organize` or `export` a reqres containing `https://web.archive.org/web/<something>/<URL>` as if it was just a `<URL>`.
1795
1907
- `*`:
1796
1908
- Non-dumb `HTTP` server with time+URL index and replay, i.e. a local `HTTP` UI a-la [Wayback Machine](https://web.archive.org/).
description = "Display, search, programmatically extract values from, organize, manipulate, import, and export Web Request+Response (`WRR`) files produced by the `Hoardy-Web` Web Extension browser add-on."
0 commit comments