-
-
Notifications
You must be signed in to change notification settings - Fork 232
Description
Issue
The National Diet Library, Japan (NDL) https://warp.ndl.go.jp/, is currently in the process of transitioning from OpenWayback to pywb. During this transition, we have identified a significant issue related to SURT generation and pywb’s lookup functionality, which affects our ability to accurately replay archived content.
When generating a CDXJ using cdxj-indexer from URLs like the following:
The www/www{number} (e.g., www2, www3, www4, etc.) subdomain is currently removed during SURT generation. This results in both URLs being converted into the same SURT:
jp,go,example)/
Consequently, even when users intentionally try to distinguish between URLs a and b when accessing them via pywb, pywb treats them as the same URL. This results in content associated with both a and b being displayed interchangeably, leading to incorrect replay behavior.
In Japan, it is common practice to use URLs that differ only in the www/www{number} subdomain to serve different content, not for purposes like load balancing or mirroring. This is a critical distinction for the accuracy of our web archives.
Examples from the National Diet Library’s Web Archiving Project (WARP) via OpenWayback:
- https://warp.ndl.go.jp/info:ndljp/pid/7860027/www.mlit.go.jp
- https://warp.ndl.go.jp/info:ndljp/pid/7860027/www3.mlit.go.jp
Request
To address this, we request the implementation of the following features as optional settings:
- In cdxj-indexer: An option to generate SURTs that preserve and distinguish between www and www{number} subdomains.
- In pywb: Support for looking up SURTs generated with the above distinction.
While this concern might also be appropriate for the cdxj-indexer issue tracker, NDL believe both indexing (cdxj-indexer) and lookup (pywb) functionalities need to be addressed collaboratively. Therefore, we are initiating this discussion here.
Environment
pywb version
2.9.0 beta 0
pywb config (config.yaml)
enable_memento: false
framed_replay: false
static_prefix: plugin/pywb
collections:
- {col}
CDXJ generation tool
cdxj-indexer 1.4.5
Command used to generate CDXJ:
cdxj-indexer -f {files} -o {output_dir} {input_archive_dir}
Actual CDXJ examples showing identical SURTs for different www/www{number} URLs
Example 1
jp,go,mlit)/ 20091021062438 {"digest":"sha1:A6BR2EG7TUQWFWXTLRIDBRLERRLLITQV","filename":"WEB_330483_000441.warc.gz","length":"5100","offset":"0","url":"http://www.mlit.go.jp/"}
jp,go,mlit)/ 20091021062442 {"digest":"sha1:AQZ2VPHSC4JJRR6N4GWEVMFCUGQRFH4A","filename":"WEB_330483_000441.warc.gz","length":"3605","offset":"6917","url":"http://www3.mlit.go.jp/"}
Example 2
jp,aichi,pref)/ 20220907055832 {"digest":"sha1:25HFT6IGZEUDHV376ZRQ5Q2HC77KFPYQ","filename":"WEB_72991066_000005.warc.gz","length":"508","offset":"10249642","url":"https://www4.pref.aichi.jp/"}
jp,aichi,pref)/ 20220907032715 {"digest":"sha1:F5AVG43MODB7A42GPKJLOE5D5JSQUOPW","filename":"WEB_74395716_000040.warc.gz","length":"11700","offset":"55784","url":"https://www.pref.aichi.jp/"}