Skip to content

Add optional client-side playback to pywb #928

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 30 commits into from
Apr 24, 2025

Conversation

tw4l
Copy link
Member

@tw4l tw4l commented Mar 12, 2025

Description

This PR adds optional client-side replay in pywb's framed replay mode, using wabac.js. This is implemented using wabac.js's live proxy mode, similar to the implementation by Alex Osborne's proof of concept and enabled via the config.yaml file. Documentation has also been added.

The wabac.js static worker is included in the pywb static directory and a new route added to serve it. The wabac.js version can be updated using the included build-wabac.sh script, which fetches the service worker from the npm CDN and copies it into the static directory with the correct filename (changed in pywb from sw.js to wabacWorker.js, as we have several service workers). Edit: The wabac.js service worker is added to the pywb static directory at installation time via setup.py. The wabac.js version can be bumped via a constant in that file.

In addition, I've made a few small housekeeping changes:

  • The Python version in the pywb Dockerfile is updated to 3.11 to avoid using an unsupported version of Python
  • Similarly, CI now runs on Python 3.9-3.11 to drop older versions that are no longer supported in GH Actions runners

Note that there are currently some unrelated failing tests which will be addressed in separate PRs.

Motivation and Context

Fixes #924

To Do Before Merging

  • Bump wabac.js to 2.21.4 when it's released to fix issue noticed in testing with redirects
  • Test with a wider range of sites and pywb deployment types

Types of changes

  • Replay fix (fixes a replay specific issue)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added or updated tests to cover my changes.
  • All new and existing tests passed.

@tw4l tw4l requested a review from ikreymer March 12, 2025 18:25
@ikreymer
Copy link
Member

I wonder if we can avoid the build_wabac.sh script altogether and not add the built version altogether, and instead install it as part of the setup.py? We did this with warc2zim at one point, see:
https://github.com/openzim/warc2zim/blob/af831c02b68b17b5e41b71ee16091117154f53bf/setup.py#L27
(Probably should do same with wombat too)

@ikreymer
Copy link
Member

Otherwise, looks good!

@tw4l tw4l requested a review from ikreymer March 13, 2025 17:34
@ato
Copy link
Contributor

ato commented Mar 14, 2025

I did some quick testing against a couple of WARC files our curators recently reported replay problems with. Other than enabling client_side_replay I used the default pywb configuration and indexes generated by wb-manager add.

URL server-side pywb (old) client-side pywb (new) replayweb.page
https://hia.com.au/our-industry/hia-election-imperatives/wa-election-priorities/ blank page working perfectly working perfectly
https://www.slq.qld.gov.au/blog/john-oxley-library post list loads, clicking posts gives page not found, refresh fixes them post list just gets stuck showing a loading spinner, no obvious cause in the js console (there's some react errors but replayweb.page has them too) working perfectly
https://thedustybox.substack.com/ can't close the initial email subscription modal modal closes ok, main page looks good, some links work but most give 'Post not found' same as client-side pywb (crawl may be incomplete)

@ato
Copy link
Contributor

ato commented Mar 19, 2025

Found a potential issue: If you mount pywb with a custom URL prefix (SCRIPT_NAME) like /wayback instead of / then /static/sw.js can't be loaded because it should instead be /wayback/static/sw.js. I guess we should pass static_prefix as another constructor parameter to WarcReplay instead of having "/static/sw.js?" hardcoded in loadWabac.js.

@tw4l
Copy link
Member Author

tw4l commented Mar 19, 2025

Found a potential issue: If you mount pywb with a custom URL prefix (SCRIPT_NAME) like /wayback instead of / then /static/sw.js can't be loaded because it should instead be /wayback/static/sw.js. I guess we should pass static_prefix as another constructor parameter to WarcReplay instead of having "/static/sw.js?" hardcoded in loadWabac.js.

Nice catch, thanks for testing @ato ! I pushed a change to use the static prefix for the service worker path but it seems something's still off (testing with the Apache Docker Compose setup in sample-deploy/). Will keep looking into it.

@ato
Copy link
Contributor

ato commented Mar 20, 2025

I think the remaining problem is collName isn't getting calculated correctly in the non-/ mountpoint situation. I think we should also scope the serviceworker just to the mountpoint so that it doesn't conflict with other serviceworkers on the same host. Particularly for the case of multiple pywb instances mounted on different prefixes of the same host, but I think there's potential for conflict with other apps that use serviceworkers too.

This works for me: nla@b91c852

  1. Calculates collName as the second last path segment of prefix instead of the first path segment.
  2. Sets the serviceworker scope to the path from static_prefix with "static" stripped from the end.

Note: I found when testing if I had a serviceworker still running at / from previous test runs I needed to manually unload it with the Firefox devtools (Application tab -> Service Workers).

I'm not very satisfied with the way I've calculated those paths. It might be cleaner if they were passed through from the python side, but {{ coll }} in the jinja template just resolves to the empty string for some reason and I haven't dug into how to plumb them through from the calling code.

I realized another configuration case we're currently not handling is the special $root collection. I briefly tried to make that work but didn't get very far, for some reason it just displays the serviceworker source code in the iframe. 😆

@ato
Copy link
Contributor

ato commented Mar 26, 2025

We've been doing some more testing and one of our curators noticed that opening links in a new tab doesn't work. Either through user action or the link containing target=_blank. It results in either the error "Collection not found: w" or a redirect loop.

Right-clicking to copy a link and pasting it into a new browser session also similarly fails.

I guess the solution is to make the server handle /w/ URLs. I guess it would be nice if nav links could be non-/w/ URLs, although I'm not certain how feasible that is.

@ato
Copy link
Contributor

ato commented Mar 26, 2025

I tried adding a redirect into frontendapp.py:

self.url_map.add(Rule('/w/<coll>/<int:timestamp>mp_/<path:url>', redirect_to='<coll>/<timestamp>/<url>'))

This helps when copying a link into a clean browser session, however the open in new tab case still seems broken, it just redirects to the first URL that was opened - even that was from a completely different site! It looks like that's because of the baseUrl redirect in collection.ts Collection.makeTopFrame().

@ato
Copy link
Contributor

ato commented Mar 26, 2025

Not sure if this is the best way but I managed to fix it by removing baseUrl and baseUrlHashReplay and instead setting topTemplateUrl so that it just redirects top-frame /w/ URLs to the pywb main URL (the no modifier one). Here are the commits for that and for the frontendapp redirect. Both are needed, one for the case the browser has the serviceworker already running and the other for if it doesn't have it yet (e.g. opening a copy pasted /w/ URL).

nla@5245128
nla@a5aef9c

Edit: I found I needed to clear site data for this change to take effect. I think because the collection config gets stored somewhere in the browser local data.

@ikreymer
Copy link
Member

@ato thanks for the detailed feedback. I pushed some changes, that I think addresses the issues you've raised:

  • service worker scope + collection name are now passed in
  • non-root prefix should be supported, reflected in the service worker scope that is set
  • root collections are also supported, the root collection is also set in wabac.js

There were a couple of changes in wabac.js needed (webrecorder/wabac.js#237)
This PR now is set to build the sw.js from that branch for easier testing (to be removed before merge).

@ikreymer ikreymer marked this pull request as draft March 27, 2025 03:02
@ikreymer
Copy link
Member

I wasn't seeing a need to add a /w/... redirect in pywb - that path should be getting handled by the service worker always.
Perhaps I missed a case where it may be needed, or hopefully this is now fixed. The top-frame URL is also handled with builtin support for ts/url in wabac.js, so no need for a completely custom option (though need that that worked!)

@ato
Copy link
Contributor

ato commented Mar 27, 2025

I pushed some changes, that I think addresses the issues you've raised

Brilliant. Thanks. I tested the changes and found these are all now working:

  • Non-/ mountpoints
  • $root collection
  • $root collection on a non-/ mountpoint
  • Right click -> Open link in new tab
  • Links with target=_blank

I wasn't seeing a need to add a /w/... redirect in pywb - that path should be getting handled by the service worker always.

The scenario where I think this is needed is:

  1. User right clicks a link in an archived page and selects 'Copy link address'. This copies a /w/ URL.
  2. They paste the link into an email to a friend.
  3. Friend has never visited the archive before and so their browser doesn't have the serviceworker registered.
  4. Friend opens the emailed link and sees "Pywb error: Collection not found: w".

It's not a big deal, we can just add an nginx redirect in our deployment to cover it. I've just already seen emails containing /w/ URLs so I'm fully expecting them to end up all over the place even though they aren't ideally what we'd like people to be citing.

@ikreymer
Copy link
Member

The scenario where I think this is needed is:

  1. User right clicks a link in an archived page and selects 'Copy link address'. This copies a /w/ URL.
  2. They paste the link into an email to a friend.
  3. Friend has never visited the archive before and so their browser doesn't have the serviceworker registered.
  4. Friend opens the emailed link and sees "Pywb error: Collection not found: w".

It's not a big deal, we can just add an nginx redirect in our deployment to cover it. I've just already seen emails containing /w/ URLs so I'm fully expecting them to end up all over the place even though they aren't ideally what we'd like people to be citing.

Ah ok, that makes sense - just pushed a changed that turns off the 'w' suffix in wabac.js. I think that works, if you have a chance to test a bit.
The initial load will be w/o the service worker, but then it redirects to the top frame, which loads the service worker, and replay is loaded through the service worker.

@tw4l tw4l force-pushed the issue-924-client-side-playback branch from 37d2c35 to 78cf834 Compare March 27, 2025 18:58
@tw4l tw4l marked this pull request as ready for review March 27, 2025 19:31
@ikreymer
Copy link
Member

I did some quick testing against a couple of WARC files our curators recently reported replay problems with. Other than enabling client_side_replay I used the default pywb configuration and indexes generated by wb-manager add.

URL server-side pywb (old) client-side pywb (new) replayweb.page
https://hia.com.au/our-industry/hia-election-imperatives/wa-election-priorities/ blank page working perfectly working perfectly
https://www.slq.qld.gov.au/blog/john-oxley-library post list loads, clicking posts gives page not found, refresh fixes them post list just gets stuck showing a loading spinner, no obvious cause in the js console (there's some react errors but replayweb.page has them too) working perfectly
https://thedustybox.substack.com/ can't close the initial email subscription modal modal closes ok, main page looks good, some links work but most give 'Post not found' same as client-side pywb (crawl may be incomplete)

Thanks for this additional feedback - it looks like the last two use history-based nav, so was able to fix an issue related to that (URL should now update).
One of them also uses lots of graphql POST requests, so I suspect something is related to that..
I did a quick sample with ArchiveWeb.page and everything replayed ok, but of course that was with history nav.

My guess is that we need to switch to fuzzy matching being done in wabac.js and/or direct WARC loading , which would be the final stage of this, but is a bit more work.. I think we can probably release this as a first stage, and continue to do testing, and add the fuzzymatching to this in a separate update.

It's something we've wanted to add anyway, and work would be done in wabac.js supporting a CDX index server endpoint.

@ato
Copy link
Contributor

ato commented Mar 28, 2025

just pushed a changed that turns off the 'w' suffix in wabac.js. I think that works, if you have a chance to test a bit.

Oh, that's really nice. And because the server-side rewriting is still there on the same URL, just masked by the serviceworker, it even gracefully degrades for simpler clients. If you copy a link opening it in lynx or w3m just works.

@ato
Copy link
Contributor

ato commented Mar 29, 2025

just pushed a changed that turns off the 'w' suffix in wabac.js.

Upon further testing, it seems like this might not actually be working. It looks like it's just falling back to server-side rewriting.

Testing replay of https://hia.com.au/our-industry/hia-election-imperatives/wa-election-priorities/

With old commit 1a7daac the page replays fine, I see the ir_ modifier in request log:

127.0.0.1 - - [2025-03-28 16:29:54] "GET /test/20250130043909ir_/https://hia.com.au/our-industry/hia-election-imperatives/wa-election-priorities/ HTTP/1.1" 200 189757 0.008965

With current commit 78cf834 the page is blank, I see the mp_ modifier in request log:

127.0.0.1 - - [2025-03-28 16:27:37] "GET /test/20250130043909mp_/https://hia.com.au/our-industry/hia-election-imperatives/wa-election-priorities/ HTTP/1.1" 200 196607 0.050274

Poking around with the debugger I found that SWReplay.replayPrefix is http://localhost:8080// (note the double slash at the end) and so the replay code never gets called. I think this is because prefix comes from the serviceworker scope which already has a trailing slash:

this.prefix = self.registration ? self.registration.scope : "";

and then an extra slash gets appended here:

this.replayPrefix = this.prefix + (sp.get("replayPrefix") ?? "w") + "/";

@ikreymer
Copy link
Member

ikreymer commented Mar 29, 2025

Good catch, as always! Have a fix in wabac.js (webrecorder/wabac.js@121a195)
pushed a change to pin to that version here for easier testing.
Just need to 'pass through' the top frame and any other unknown requests when removing the 'w' suffix. Hopefully this does the trick!

@ikreymer
Copy link
Member

ikreymer commented Apr 2, 2025

@ato if you have a chance, can you check with latest to see if it is working now?

@ato
Copy link
Contributor

ato commented Apr 2, 2025

Yep can confirm that the test page is replaying properly now and I see the expected ir_ modifiers in the access log. Thanks for all the fixes.

@ato
Copy link
Contributor

ato commented Apr 3, 2025

I dug into this regression example a bit more https://www.slq.qld.gov.au/blog/john-oxley-library
And indeed it's getting the wrong response to the graphql query that loads the articles.

If you make the POST request directly to pywb it picks the right response:

$ curl 'http://localhost:8080/test/20250310130010mp_/https://content.slq.qld.gov.au/graphql' -X POST -H 'Content-Type: application/json' --data-raw '{"extensions":{"persistedQuery":{"version":1,"sha256Hash":"bbc6b42daa873234edec7063b3bc947f2e0164bfb86aa37bedf82fd8b3b64e78"}},"variables":{"limit":12,"filters":[{"label":"type","value":"article"},{"label":"aggregated_field","value":"1082"}],"keyword":"","sort":[{"field":"publishedOn","type":"DESC"}],"offset":0}}'
{"data":{"search":{"items":[{"__typename":"Article","description":"March Forward ...

But with wabac.js doing the convertGetToPost() transformation it instead returns the graphql response for a menu:

$ curl 'http://localhost:8080/test/20250310130010ir_/https://content.slq.qld.gov.au/graphql?__wb_method=POST&version=1&sha256Hash=bbc6b42daa873234edec7063b3bc947f2e0164bfb86aa37bedf82fd8b3b64e78&limit=12&label=type&value=article&label.2_=aggregated_field&value.2_=1082&field=publishedOn&type=DESC&offset=0'
{"data":{"menu":{"id":"utility-bar","items":[{"children":[],"inActiveTrail":false, ...

I tried enabling extraConfig.noPostToGet to see if wabac.js would pass the POST request through to pywb. But it sent the POST requests with empty bodies because LiveProxy.allowBody is false with isLive: false. Just out of curiosity I hacked the minified sw.js to set allowBody to true and that worked and resulted in it behaving basically the same as server-side replay mode: the list of articles loads, clicking an article gives a 'Page not found' error but if you refresh then the article text loads fine.

@ikreymer
Copy link
Member

ikreymer commented Apr 4, 2025

I dug into this regression example a bit more https://www.slq.qld.gov.au/blog/john-oxley-library And indeed it's getting the wrong response to the graphql query that loads the articles.
@ato thanks again! Added a fix that should add noPostToGet and keep allowBody true when that's set. See if it works now with latest here!

@ato
Copy link
Contributor

ato commented Apr 4, 2025

Yep, confirming with those changes client-side replay is now at parity with the server-side replay for the SLQ blog example. That was the last remaining regression example I had, so this is looking very promising. :-)

@tw4l tw4l force-pushed the issue-924-client-side-playback branch from 5b0fcdc to 5ae00a2 Compare April 23, 2025 20:40
@ikreymer
Copy link
Member

Thanks for the feedback.

Thanks all for the effort/work so far. I've managed to do a little bit of testing on some problem sites/issues that we have,

May have a fix for the former, which may be dependent on devicePixelRatio, which probably should default to 2 now instead of 1 as was traditionally the case. The 'waipuhighlandgames' site seems a bit tricky, it probably picks different image based on screen dimensions.

Not able to repro these - tested with both and it seemed to be fine. Perhaps the service worker wasn't active when it was being tested? Or if you have a specific page can take a look.

…ker mode, if service workers are not available,

eg. check for navigator.serviceWorker and if null, don't attempt to init sw-based path
@ikreymer ikreymer force-pushed the issue-924-client-side-playback branch from ed93cce to 18f1326 Compare April 24, 2025 18:32
@ikreymer ikreymer merged commit 3081e0f into main Apr 24, 2025
4 of 6 checks passed
@ikreymer ikreymer deleted the issue-924-client-side-playback branch April 24, 2025 18:44
annesiri added a commit to statisticsnorway/ssbno-pywb that referenced this pull request May 9, 2025
* Allow to configure uWSGI mount via environment variable (webrecorder#926)

* Introduce UWSGI_MOUNT env var

* Add a note to the documentation.

* Refuse to serve static files that are outside of static_dir (webrecorder#932)

Prevents the path traversal attack reported in webrecorder#931

* version: bump to 2.8.4

* Fix tests, support py3.9, 3.10, 3.11 (webrecorder#933)

- tests: fix or disable tests that no longer work reliably, eg. depend on external sites
- support python 3.9, 3.10, 3.11 in tests for now
- bump version to 2.9.0-beta.0

* Add optional client-side playback to pywb (webrecorder#928)

This PR adds optional client-side replay in pywb's framed replay mode, using wabac.js. This is implemented using wabac.js's live proxy mode, similar to the implementation by Alex Osborne's proof of concept and enabled via the config.yaml file. Documentation has also been added.

The service worker proxies to the original pywb URLs and allows for 'graceful fallback' if service workers are not supported.

Client side replay can be enabled by setting `client_side_replay: true` in config.yaml

The wabac.js service worker is added to the pywb static directory at installation time via setup.py. The wabac.js version can be bumped via a constant in that file (current version is 2.22.12)

In addition, a few small housekeeping changes are also included:
- The Python version in the pywb Dockerfile is updated to 3.11 to avoid using an unsupported version of Python
- Similarly, CI now runs on Python 3.9-3.11 to drop older versions that are no longer supported in GH Actions runners
- wombat updated to latest 2.8.10

bump version to 2.9.0-beta.0
---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>

* Update README.rst

* Fix py3.9 CI (webrecorder#934)

* ci: attempt to fix tests for 3.9 by skipping test that intermittently hang.

* simplify: use existing prefix as archivePrefix, fixes webrecorder#937 (webrecorder#938)

update to wabac.js 2.22.15
bump to 2.9.0b1

---------

Co-authored-by: Natanael Arndt <arndtn@gmail.com>
Co-authored-by: Alex Osborne <aosborne@nla.gov.au>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
annesiri added a commit to statisticsnorway/ssbno-pywb that referenced this pull request May 26, 2025
* added the pvc file

* MIIM-2185-probes (#13)

* probes

* Fetching changes from upstream repo (#14)

* Allow to configure uWSGI mount via environment variable (webrecorder#926)

* Introduce UWSGI_MOUNT env var

* Add a note to the documentation.

* Refuse to serve static files that are outside of static_dir (webrecorder#932)

Prevents the path traversal attack reported in webrecorder#931

* version: bump to 2.8.4

* Fix tests, support py3.9, 3.10, 3.11 (webrecorder#933)

- tests: fix or disable tests that no longer work reliably, eg. depend on external sites
- support python 3.9, 3.10, 3.11 in tests for now
- bump version to 2.9.0-beta.0

* Add optional client-side playback to pywb (webrecorder#928)

This PR adds optional client-side replay in pywb's framed replay mode, using wabac.js. This is implemented using wabac.js's live proxy mode, similar to the implementation by Alex Osborne's proof of concept and enabled via the config.yaml file. Documentation has also been added.

The service worker proxies to the original pywb URLs and allows for 'graceful fallback' if service workers are not supported.

Client side replay can be enabled by setting `client_side_replay: true` in config.yaml

The wabac.js service worker is added to the pywb static directory at installation time via setup.py. The wabac.js version can be bumped via a constant in that file (current version is 2.22.12)

In addition, a few small housekeeping changes are also included:
- The Python version in the pywb Dockerfile is updated to 3.11 to avoid using an unsupported version of Python
- Similarly, CI now runs on Python 3.9-3.11 to drop older versions that are no longer supported in GH Actions runners
- wombat updated to latest 2.8.10

bump version to 2.9.0-beta.0
---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>

* Update README.rst

* Fix py3.9 CI (webrecorder#934)

* ci: attempt to fix tests for 3.9 by skipping test that intermittently hang.

* simplify: use existing prefix as archivePrefix, fixes webrecorder#937 (webrecorder#938)

update to wabac.js 2.22.15
bump to 2.9.0b1

---------

Co-authored-by: Natanael Arndt <arndtn@gmail.com>
Co-authored-by: Alex Osborne <aosborne@nla.gov.au>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>

* fix live and ready (#15)

* added updating docker image in Dockerfile (#16)

* Misc updates to reduce vulnerabilities (#17)

* Experimental cleanup

* a few more upgrades

* removed comment

---------

Co-authored-by: Rune Johansen <runejo@gmail.com>

---------

Co-authored-by: Carl-OW <142233642+Carl-OW@users.noreply.github.com>
Co-authored-by: Natanael Arndt <arndtn@gmail.com>
Co-authored-by: Alex Osborne <aosborne@nla.gov.au>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: Rune Johansen <runejo@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add client-side playback as option to pywb
4 participants