Skip to content

Commit 0f4c9c1

Browse files
committed
tool/*, CHANGELOG.md: bump version
1 parent a98667d commit 0f4c9c1

File tree

3 files changed

+117
-5
lines changed

3 files changed

+117
-5
lines changed

CHANGELOG.md

Lines changed: 115 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,115 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm
66

77
Also, at the bottom of this file there is [a TODO list](#todo) with planned future changes.
88

9+
## [tool-v0.19.0] - 2024-12-07: Powerful filtering, exporting of different URL visits, hybrid export modes
10+
11+
### Changed: Semantics
12+
13+
- `*`:
14+
15+
- In `--expr` expressions, `sha256` function changed semantics.
16+
From now on it returns the raw hash digest instead of the hexadecimal one.
17+
To get the old value, use `sha256|to_hex`.
18+
19+
### Added
20+
21+
- `*` except `organize --move`, `organize --hardlink`, `organize --symlink`, `get`, and `run`:
22+
23+
- From now on, all sub-commands except for above can take inputs in all supported file formats.
24+
25+
I.e., you can now do
26+
27+
```bash
28+
hoardy-web export mirror --to ~/hoardy-web/mirror1 mitmproxy.*.dump
29+
```
30+
31+
on `mitmproxy` dumps without even `import`ing them first.
32+
33+
- By default, the above commands now also automatically dispatch between loaders of different file formats based on file extensions.
34+
So you can mix and match different file formats on the same command line.
35+
36+
- Added a bunch of `--load-*` options that force a specific loader instead, e.g. `--load-wrrb`, `--load-mitmproxy`.
37+
38+
- `*`:
39+
40+
- Added a ton of new filtering options.
41+
42+
For example, you can now do:
43+
44+
```bash
45+
hoardy-web find --method GET --method DOM --status-re .200C --response-mime text/html \
46+
--response-body-grep-re "\bPotter\b" ~/hoardy-web/raw
47+
```
48+
49+
As before, these filters can still be used with other commands, like `stream`, or `export mirror`, etc.
50+
51+
`--root-*` options of `export mirror` now use the same syntax and machinery as the normal input filters.
52+
53+
Also, the overall filtering semantics changed a bit.
54+
The top-level logical expression the filters compute is now a large conjunction.
55+
I.e. the above example now compiles to, a bit simplified, `(response.method == "GET" or response.method == "DOM") and re.match(".200C", status) and (response_mime == "text/html") and re.match("\\bPotter\\b", response.body)`.
56+
57+
- Added a bunch of new `--output` formats.
58+
Mostly, this adds a bunch of output formats that refer to `stime`s.
59+
Mainly, to simplify `export mirror --all` usage, described below.
60+
61+
- `export mirror`:
62+
63+
- Implemented exporting of different URL visits.
64+
65+
I.e., you can now export not just `--latest` visit to each URL, but an `--oldest` one, or one `--nearest` to a given date, or `--all` of them.
66+
67+
- Implemented `--latest-hybrid`, `--oldest-hybrid`, and `--nearest-hybrid` options.
68+
69+
These allow you to export each page with resource requisites that are date-vise closest to the `stime` of the page itself, instead of taking globally `--latest`, `--oldest`, or `--nearest` versions of all requisite URLs.
70+
71+
At the moment, this takes a lot more memory, but makes the results much more consistent for websites that do not use versioned resource requisites.
72+
73+
- Implemented `--hardlink` and `--symlink` options, which allow exporting into content-addressed destinations.
74+
75+
I.e. `export mirror --hardlink` will render and write each exported file to `<--to>/_content/<hash/based/path>.<ext>` and only then hardlink the result to `<--to>/<output/format/based/path>.<ext>` target destination.
76+
And similarly for `--symlink`.
77+
78+
Typically, doing this saves quite a bit of space, e.g., when pages refer to the same resource requisites by slightly different URLs, same images and fonts get distributed via different CDN hosts, when you export `--all` visits to some URLs and many of those are absolutely identical, etc.
79+
80+
So, from now on, `--hardlink` is the default.
81+
The old behavior can be archived by running it with `--copy` instead.
82+
83+
- Implemented `--relative` and `--absolute` options, which control if URLs should be remapped to relative or absolute `file:` URLs, respectively.
84+
85+
- Documented all the new things.
86+
87+
- Added a bunch of new `test-cli.sh` tests.
88+
89+
### Changed
90+
91+
- `export mirror`:
92+
93+
- Switched default `--output` to `hupq_n` to prevent collisions when using `--*-hybrid` and `--all`.
94+
95+
- Improved handling of `base` `HTML` tags, `_target`s are supported now.
96+
97+
- Links that reference a page from itself will no longer refer to the page's filename, even when the link has no `fragment`.
98+
99+
The results can be a bit confusing, but this makes the new content de-duplication options much more effective.
100+
101+
- Made `export mirror` default filters explicit and changed them from `--method "GET" --status-re ".200C"` to `--method "GET" --method "DOM" --status-re ".200C"`.
102+
103+
- Implemented `--ignore-bad-inputs` and `--index-all-inputs` options to allow you to change the above default.
104+
105+
- Improved output log format.
106+
107+
- Improved file loading performance a bit.
108+
109+
- Improved documentation.
110+
111+
### Fixed
112+
113+
- Added a bunch of new tests for `organize`, which cover the `organize --symlink --latest` bug of `tool-v0.18.0`.
114+
Won't happen again.
115+
116+
- Fixed a couple of silly filtering-related bugs.
117+
9118
## [tool-v0.18.1] - 2024-11-30: Hotfixes
10119

11120
### Fixed
@@ -1700,6 +1809,7 @@ All planned features are complete now.
17001809
17011810
- Initial public release.
17021811
1812+
[tool-v0.19.0]: https://github.com/Own-Data-Privateer/hoardy-web/compare/tool-v0.18.1...tool-v0.19.0
17031813
[tool-v0.18.1]: https://github.com/Own-Data-Privateer/hoardy-web/compare/tool-v0.18.0...tool-v0.18.1
17041814
[tool-v0.18.0]: https://github.com/Own-Data-Privateer/hoardy-web/compare/tool-v0.17.0...tool-v0.18.0
17051815
[tool-v0.17.0]: https://github.com/Own-Data-Privateer/hoardy-web/compare/tool-v0.16.0...tool-v0.17.0
@@ -1781,16 +1891,18 @@ All planned features are complete now.
17811891
17821892
## `hoardy-web` tool
17831893
1784-
- `scrub`:
1894+
- `export mirror`, `scrub`:
17851895
- Handle SRI things.
17861896
- Handle CSP things.
17871897
- `export mirror`:
17881898
- Implement `export mirror --standalone`, which would inline all resources into each exported page, a-la `SingleFile`.
1789-
- `*`:
1899+
- `organize`:
17901900
- Implement automatic discernment of relatedness of `WRR` files (by URLs and similarity) and packing of related files into `WRR` bundles.
17911901
- Maybe: Implement data de-duplication between `WRR` files.
1792-
- Implement `un206` command, which would reassemble a bunch of `GET 206` `WRR` files into a single `GET 200` `WRR` file.
1902+
- Implement `un206` command/option, which would reassemble a bunch of `GET 206` `WRR` files into a single `GET 200` `WRR` file.
17931903
- `export mirror`, `organize`:
1904+
- Allow unloading and lazy re-loading of reqres loaded from anything other than separate `WRR` files.
1905+
The fact that this is not possible at the moment makes memory consumption in those cases rather abysmal.
17941906
- Implement on-the-fly mangling of reqres, so that, e.g. you could `organize` or `export` a reqres containing `https://web.archive.org/web/<something>/<URL>` as if it was just a `<URL>`.
17951907
- `*`:
17961908
- Non-dumb `HTTP` server with time+URL index and replay, i.e. a local `HTTP` UI a-la [Wayback Machine](https://web.archive.org/).

tool/default.nix

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ with pkgs.python3Packages;
1313

1414
buildPythonApplication (rec {
1515
pname = "hoardy-web";
16-
version = "0.18.1";
16+
version = "0.19.0";
1717
format = "pyproject";
1818

1919
inherit (source) src unpackPhase;

tool/pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ build-backend = "setuptools.build_meta"
55
packages = ["hoardy_web"]
66
[project]
77
name = "hoardy-web"
8-
version = "0.18.1"
8+
version = "0.19.0"
99
authors = [{ name = "Jan Malakhovski", email = "oxij@oxij.org" }]
1010
description = "Display, search, programmatically extract values from, organize, manipulate, import, and export Web Request+Response (`WRR`) files produced by the `Hoardy-Web` Web Extension browser add-on."
1111
readme = "README.md"

0 commit comments

Comments
 (0)