Skip to content

Support distinguishing www and www{number} in SURT generation and pywb's lookup #943

@KazuhiroAND

Description

@KazuhiroAND

Issue

The National Diet Library, Japan (NDL) https://warp.ndl.go.jp/, is currently in the process of transitioning from OpenWayback to pywb. During this transition, we have identified a significant issue related to SURT generation and pywb’s lookup functionality, which affects our ability to accurately replay archived content.

When generating a CDXJ using cdxj-indexer from URLs like the following:

The www/www{number} (e.g., www2, www3, www4, etc.) subdomain is currently removed during SURT generation. This results in both URLs being converted into the same SURT:

jp,go,example)/

Consequently, even when users intentionally try to distinguish between URLs a and b when accessing them via pywb, pywb treats them as the same URL. This results in content associated with both a and b being displayed interchangeably, leading to incorrect replay behavior.

In Japan, it is common practice to use URLs that differ only in the www/www{number} subdomain to serve different content, not for purposes like load balancing or mirroring. This is a critical distinction for the accuracy of our web archives.

Examples from the National Diet Library’s Web Archiving Project (WARP) via OpenWayback:

Request

To address this, we request the implementation of the following features as optional settings:

  • In cdxj-indexer: An option to generate SURTs that preserve and distinguish between www and www{number} subdomains.
  • In pywb: Support for looking up SURTs generated with the above distinction.

While this concern might also be appropriate for the cdxj-indexer issue tracker, NDL believe both indexing (cdxj-indexer) and lookup (pywb) functionalities need to be addressed collaboratively. Therefore, we are initiating this discussion here.

Environment

pywb version

2.9.0 beta 0

pywb config (config.yaml)

enable_memento: false
framed_replay: false
static_prefix: plugin/pywb
collections:
 - {col}

CDXJ generation tool

cdxj-indexer 1.4.5

Command used to generate CDXJ:

cdxj-indexer -f {files} -o {output_dir} {input_archive_dir}

Actual CDXJ examples showing identical SURTs for different www/www{number} URLs

Example 1

jp,go,mlit)/ 20091021062438 {"digest":"sha1:A6BR2EG7TUQWFWXTLRIDBRLERRLLITQV","filename":"WEB_330483_000441.warc.gz","length":"5100","offset":"0","url":"http://www.mlit.go.jp/"}
jp,go,mlit)/ 20091021062442 {"digest":"sha1:AQZ2VPHSC4JJRR6N4GWEVMFCUGQRFH4A","filename":"WEB_330483_000441.warc.gz","length":"3605","offset":"6917","url":"http://www3.mlit.go.jp/"}

Example 2

jp,aichi,pref)/ 20220907055832 {"digest":"sha1:25HFT6IGZEUDHV376ZRQ5Q2HC77KFPYQ","filename":"WEB_72991066_000005.warc.gz","length":"508","offset":"10249642","url":"https://www4.pref.aichi.jp/"}
jp,aichi,pref)/ 20220907032715 {"digest":"sha1:F5AVG43MODB7A42GPKJLOE5D5JSQUOPW","filename":"WEB_74395716_000040.warc.gz","length":"11700","offset":"55784","url":"https://www.pref.aichi.jp/"}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions