Skip to content

Add tools to import from time-map based sources #180

@Mr0grog

Description

@Mr0grog

We currently import information about captures from the Wayback machine using their CDX API (which also underlies their time-map API) and have some ugly but battle-tested tooling for doing so: https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/main/web_monitoring/cli/cli.py

However, there are other archives that it’s occasionally useful to pull from and which only support the more standardized time-map API. For example, this Jan 2 copy of https://www.energy.gov/justice/articles/state-energy-insecurity-data-visualization-tool was pulled from Archive-It but I had to add it manually.

I think this requires work in two places:

  1. The workflow would be extremely similar to how we use CDX with the IA Wayback Machine, so part of this is just generalizing that.

  2. The Wayback package should probably also be updated to do a lot of the low-level work there:

    • Support endpoints other than archive.org. We probably need a way to configure the endpoint(s) for WaybackClient (there are hardcoded URLs in a bunch of places that need to go) and maybe some presents or subclasses for well-known services.
    • Support getting results from a time-map instead of CDX search. We probably (?) want the same kind of lazy iterator output, but this maybe need some more noodling on the design.

For now, this is a nice-to-have. I’ve only had a serious need once, and the amount of data was small enough that I could just format the import call to web-monitoring-db by hand.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions