Description
Describe the bug
When using OutbackCDX as an index server, the __wb_post_data is not sent with the url to the outbackcdx server. On webpages with multiple XHR POSTs to the same URL, this will return the wrong data. Using a local CDXJ file index works as expected.
Steps to reproduce the bug
- Create warc of public page at http://www.corona-data.ch/.
- Index with outbackcdx using a command similar to
cdx-indexer -p -s corona-data.warc.gz | curl -X POST --data-binary @- http://127.0.0.1:8078/collection
- Open the page in pywb replay -> most of the diagrams will stay white.
Expected behavior
The replayed POST requests should contain correct responses (so the diagrams can be drawn)
Screenshots
Replayed page with invalid (white) diagrams. The reason for this is that the CDX information for the POST requests to _dash-update-components are not passed with the query.
Environment
- pywb 2.4.2
Additional context
I tried to track this down to the _get_api_url
function in warcserver/indexsource.py. The url used does not contain the __wb_post_data
. FileIndexSource
uses the key
parameter. So I see the following options:
- Passing the
key
using theurlkey
parameter of outbackcdx (and updating documentation) - Adding
__wb_post_data
to the url parameter
There might be also be other options to consider. Also, the __wb_post_data changed to __warc_post_data with cdxj-indexer, so maybe there is more development going on. I'd be interested to contribute a fix, but need some guidance as to the best way.
Update. Quote from the OutbackCDX page: "The canonicalized URL (first field) is ignored, OutbackCDX performs its own canonicalization." - indexing in OutbackCDX seems to ignore the __wb_post_data parameter, so this might need further evaluation/coordination.