-
-
Notifications
You must be signed in to change notification settings - Fork 228
Add optional client-side playback to pywb #928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I wonder if we can avoid the build_wabac.sh script altogether and not add the built version altogether, and instead install it as part of the setup.py? We did this with warc2zim at one point, see: |
Otherwise, looks good! |
I did some quick testing against a couple of WARC files our curators recently reported replay problems with. Other than enabling client_side_replay I used the default pywb configuration and indexes generated by
|
Found a potential issue: If you mount pywb with a custom URL prefix (SCRIPT_NAME) like /wayback instead of / then /static/sw.js can't be loaded because it should instead be /wayback/static/sw.js. I guess we should pass static_prefix as another constructor parameter to WarcReplay instead of having "/static/sw.js?" hardcoded in loadWabac.js. |
Nice catch, thanks for testing @ato ! I pushed a change to use the static prefix for the service worker path but it seems something's still off (testing with the Apache Docker Compose setup in |
I think the remaining problem is This works for me: nla@b91c852
Note: I found when testing if I had a serviceworker still running at / from previous test runs I needed to manually unload it with the Firefox devtools (Application tab -> Service Workers). I'm not very satisfied with the way I've calculated those paths. It might be cleaner if they were passed through from the python side, but I realized another configuration case we're currently not handling is the special $root collection. I briefly tried to make that work but didn't get very far, for some reason it just displays the serviceworker source code in the iframe. 😆 |
We've been doing some more testing and one of our curators noticed that opening links in a new tab doesn't work. Either through user action or the link containing Right-clicking to copy a link and pasting it into a new browser session also similarly fails. I guess the solution is to make the server handle /w/ URLs. I guess it would be nice if nav links could be non-/w/ URLs, although I'm not certain how feasible that is. |
I tried adding a redirect into frontendapp.py: self.url_map.add(Rule('/w/<coll>/<int:timestamp>mp_/<path:url>', redirect_to='<coll>/<timestamp>/<url>')) This helps when copying a link into a clean browser session, however the open in new tab case still seems broken, it just redirects to the first URL that was opened - even that was from a completely different site! It looks like that's because of the baseUrl redirect in collection.ts Collection.makeTopFrame(). |
Not sure if this is the best way but I managed to fix it by removing Edit: I found I needed to clear site data for this change to take effect. I think because the collection config gets stored somewhere in the browser local data. |
@ato thanks for the detailed feedback. I pushed some changes, that I think addresses the issues you've raised:
There were a couple of changes in wabac.js needed (webrecorder/wabac.js#237) |
I wasn't seeing a need to add a |
Brilliant. Thanks. I tested the changes and found these are all now working:
The scenario where I think this is needed is:
It's not a big deal, we can just add an nginx redirect in our deployment to cover it. I've just already seen emails containing /w/ URLs so I'm fully expecting them to end up all over the place even though they aren't ideally what we'd like people to be citing. |
Ah ok, that makes sense - just pushed a changed that turns off the 'w' suffix in wabac.js. I think that works, if you have a chance to test a bit. |
37d2c35
to
78cf834
Compare
Thanks for this additional feedback - it looks like the last two use history-based nav, so was able to fix an issue related to that (URL should now update). My guess is that we need to switch to fuzzy matching being done in wabac.js and/or direct WARC loading , which would be the final stage of this, but is a bit more work.. I think we can probably release this as a first stage, and continue to do testing, and add the fuzzymatching to this in a separate update. It's something we've wanted to add anyway, and work would be done in wabac.js supporting a CDX index server endpoint. |
Oh, that's really nice. And because the server-side rewriting is still there on the same URL, just masked by the serviceworker, it even gracefully degrades for simpler clients. If you copy a link opening it in lynx or w3m just works. |
Upon further testing, it seems like this might not actually be working. It looks like it's just falling back to server-side rewriting. Testing replay of https://hia.com.au/our-industry/hia-election-imperatives/wa-election-priorities/ With old commit 1a7daac the page replays fine, I see the
With current commit 78cf834 the page is blank, I see the
Poking around with the debugger I found that SWReplay.replayPrefix is this.prefix = self.registration ? self.registration.scope : ""; and then an extra slash gets appended here: this.replayPrefix = this.prefix + (sp.get("replayPrefix") ?? "w") + "/"; |
Good catch, as always! Have a fix in wabac.js (webrecorder/wabac.js@121a195) |
@ato if you have a chance, can you check with latest to see if it is working now? |
Yep can confirm that the test page is replaying properly now and I see the expected ir_ modifiers in the access log. Thanks for all the fixes. |
I dug into this regression example a bit more https://www.slq.qld.gov.au/blog/john-oxley-library If you make the POST request directly to pywb it picks the right response:
But with wabac.js doing the convertGetToPost() transformation it instead returns the graphql response for a menu:
I tried enabling |
|
Yep, confirming with those changes client-side replay is now at parity with the server-side replay for the SLQ blog example. That was the last remaining regression example I had, so this is looking very promising. :-) |
- pass swScope directly to loadWabac - pass collection name to loadWabac - support running with non-root prefix - support new 'baseUrlAppendReplay' flag in wabac.js - support '$root' collection, both in pywb and wabac.js same path
ensure history changes also update url in client-side replay mode
5b0fcdc
to
5ae00a2
Compare
Thanks for the feedback.
May have a fix for the former, which may be dependent on devicePixelRatio, which probably should default to 2 now instead of 1 as was traditionally the case. The 'waipuhighlandgames' site seems a bit tricky, it probably picks different image based on screen dimensions.
Not able to repro these - tested with both and it seemed to be fine. Perhaps the service worker wasn't active when it was being tested? Or if you have a specific page can take a look. |
also update wombat dependency to latest
…ker mode, if service workers are not available, eg. check for navigator.serviceWorker and if null, don't attempt to init sw-based path
ed93cce
to
18f1326
Compare
* Allow to configure uWSGI mount via environment variable (webrecorder#926) * Introduce UWSGI_MOUNT env var * Add a note to the documentation. * Refuse to serve static files that are outside of static_dir (webrecorder#932) Prevents the path traversal attack reported in webrecorder#931 * version: bump to 2.8.4 * Fix tests, support py3.9, 3.10, 3.11 (webrecorder#933) - tests: fix or disable tests that no longer work reliably, eg. depend on external sites - support python 3.9, 3.10, 3.11 in tests for now - bump version to 2.9.0-beta.0 * Add optional client-side playback to pywb (webrecorder#928) This PR adds optional client-side replay in pywb's framed replay mode, using wabac.js. This is implemented using wabac.js's live proxy mode, similar to the implementation by Alex Osborne's proof of concept and enabled via the config.yaml file. Documentation has also been added. The service worker proxies to the original pywb URLs and allows for 'graceful fallback' if service workers are not supported. Client side replay can be enabled by setting `client_side_replay: true` in config.yaml The wabac.js service worker is added to the pywb static directory at installation time via setup.py. The wabac.js version can be bumped via a constant in that file (current version is 2.22.12) In addition, a few small housekeeping changes are also included: - The Python version in the pywb Dockerfile is updated to 3.11 to avoid using an unsupported version of Python - Similarly, CI now runs on Python 3.9-3.11 to drop older versions that are no longer supported in GH Actions runners - wombat updated to latest 2.8.10 bump version to 2.9.0-beta.0 --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> * Update README.rst * Fix py3.9 CI (webrecorder#934) * ci: attempt to fix tests for 3.9 by skipping test that intermittently hang. * simplify: use existing prefix as archivePrefix, fixes webrecorder#937 (webrecorder#938) update to wabac.js 2.22.15 bump to 2.9.0b1 --------- Co-authored-by: Natanael Arndt <arndtn@gmail.com> Co-authored-by: Alex Osborne <aosborne@nla.gov.au> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com> Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
* added the pvc file * MIIM-2185-probes (#13) * probes * Fetching changes from upstream repo (#14) * Allow to configure uWSGI mount via environment variable (webrecorder#926) * Introduce UWSGI_MOUNT env var * Add a note to the documentation. * Refuse to serve static files that are outside of static_dir (webrecorder#932) Prevents the path traversal attack reported in webrecorder#931 * version: bump to 2.8.4 * Fix tests, support py3.9, 3.10, 3.11 (webrecorder#933) - tests: fix or disable tests that no longer work reliably, eg. depend on external sites - support python 3.9, 3.10, 3.11 in tests for now - bump version to 2.9.0-beta.0 * Add optional client-side playback to pywb (webrecorder#928) This PR adds optional client-side replay in pywb's framed replay mode, using wabac.js. This is implemented using wabac.js's live proxy mode, similar to the implementation by Alex Osborne's proof of concept and enabled via the config.yaml file. Documentation has also been added. The service worker proxies to the original pywb URLs and allows for 'graceful fallback' if service workers are not supported. Client side replay can be enabled by setting `client_side_replay: true` in config.yaml The wabac.js service worker is added to the pywb static directory at installation time via setup.py. The wabac.js version can be bumped via a constant in that file (current version is 2.22.12) In addition, a few small housekeeping changes are also included: - The Python version in the pywb Dockerfile is updated to 3.11 to avoid using an unsupported version of Python - Similarly, CI now runs on Python 3.9-3.11 to drop older versions that are no longer supported in GH Actions runners - wombat updated to latest 2.8.10 bump version to 2.9.0-beta.0 --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> * Update README.rst * Fix py3.9 CI (webrecorder#934) * ci: attempt to fix tests for 3.9 by skipping test that intermittently hang. * simplify: use existing prefix as archivePrefix, fixes webrecorder#937 (webrecorder#938) update to wabac.js 2.22.15 bump to 2.9.0b1 --------- Co-authored-by: Natanael Arndt <arndtn@gmail.com> Co-authored-by: Alex Osborne <aosborne@nla.gov.au> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com> Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> * fix live and ready (#15) * added updating docker image in Dockerfile (#16) * Misc updates to reduce vulnerabilities (#17) * Experimental cleanup * a few more upgrades * removed comment --------- Co-authored-by: Rune Johansen <runejo@gmail.com> --------- Co-authored-by: Carl-OW <142233642+Carl-OW@users.noreply.github.com> Co-authored-by: Natanael Arndt <arndtn@gmail.com> Co-authored-by: Alex Osborne <aosborne@nla.gov.au> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com> Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> Co-authored-by: Rune Johansen <runejo@gmail.com>
Description
This PR adds optional client-side replay in pywb's framed replay mode, using wabac.js. This is implemented using wabac.js's live proxy mode, similar to the implementation by Alex Osborne's proof of concept and enabled via the
config.yaml
file. Documentation has also been added.The wabac.js static worker is included in the pywb static directory and a new route added to serve it. The wabac.js version can be updated using the includedEdit: The wabac.js service worker is added to the pywb static directory at installation time via setup.py. The wabac.js version can be bumped via a constant in that file.build-wabac.sh
script, which fetches the service worker from the npm CDN and copies it into the static directory with the correct filename (changed in pywb fromsw.js
towabacWorker.js
, as we have several service workers).In addition, I've made a few small housekeeping changes:
Note that there are currently some unrelated failing tests which will be addressed in separate PRs.
Motivation and Context
Fixes #924
To Do Before Merging
Types of changes
Checklist: