-
-
Notifications
You must be signed in to change notification settings - Fork 17
Description
We currently import information about captures from the Wayback machine using their CDX API (which also underlies their time-map API) and have some ugly but battle-tested tooling for doing so: https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/main/web_monitoring/cli/cli.py
However, there are other archives that it’s occasionally useful to pull from and which only support the more standardized time-map API. For example, this Jan 2 copy of https://www.energy.gov/justice/articles/state-energy-insecurity-data-visualization-tool was pulled from Archive-It but I had to add it manually.
I think this requires work in two places:
-
The workflow would be extremely similar to how we use CDX with the IA Wayback Machine, so part of this is just generalizing that.
-
The Wayback package should probably also be updated to do a lot of the low-level work there:
- Support endpoints other than
archive.org
. We probably need a way to configure the endpoint(s) forWaybackClient
(there are hardcoded URLs in a bunch of places that need to go) and maybe some presents or subclasses for well-known services. - Support getting results from a time-map instead of CDX search. We probably (?) want the same kind of lazy iterator output, but this maybe need some more noodling on the design.
- Support endpoints other than
For now, this is a nice-to-have. I’ve only had a serious need once, and the amount of data was small enough that I could just format the import call to web-monitoring-db by hand.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status