Finding aids "v2" #1967

thisisaaronland · 2021-10-29T01:09:44Z

thisisaaronland
Oct 29, 2021
Maintainer

Paging @nvkelso @stepps00 @missinglink @tomtaylor @vicchi for comments. Not required, but welcome.

I am working through a fresh take on the WOF "finding aid" model. To recap, briefly a WOF finding aid is meant to map a given ID to its corresponding whosonfirst-data repository.

The use case is something like the go-whosonfirst-browser which doesn't have a database of IDs but instead uses one or more go-reader instances to retrieve records. That is: The go-whosonfirst-browser doesn't actually know anything about where the data is coming from. It lets the "reader" handle all those details.

(Remember: The go-whosonfirst-browser does not have the search functionality of something like the Spelunker, but it primarily a tool for rendering any given known ID in a number of formats.)

One goal with the finding aids has been to create a "finding aid reader" that when given an ID would look up its corresponding repository and fetch the data over the wire from GitHub. That way the go-whosonfirst-browser could run with a minimal footprint (read: No database with a bazillion WOF records).

Version "1" of the finding aid code stored finding aids as blobs of JSON in an S3 using the similar URI/naming conventions as WOF records.

Version "2" of the finding aid code aims to move away from this model and instead publish pre-compiled indices that can be stored in a whosonfirst-data repository. These files would then be downloaded and indexed according to application-specific rules.

The source code (WIP) is but keep in mind it is lacking proper documentation right now:

https://github.com/whosonfirst/go-whosonfirst-findingaid/tree/v2

For example, this is me creating a CSV finding aid for the sfomuseum-data-maps repo, fetching the data directly from GitHub:

$> ./bin/populate \
    -iterator-uri git:///tmp \
    -provider-uri 'github://sfomuseum-data?prefix=sfomuseum-data-maps'
2021/10/28 20:08:55 time to index paths (1) 2.408854633s

$> tar -tf archive.tar.gz 
catalog.csv
sources.csv

Data is processed using the whosonfirst/go-whosonfirst-iterate package which means that it has the ability to filter records to be included (or excluded) using property filters. By default finding aids are assumed to contain "all the pointers" but this allows purpose-fit finding aids to be created. For example a finding aid for only records of a given placetype.

Iterators are separate from source "providers". The former iterate over records in a given source; the latter generates a list of sources to iterate over.

The finding aid model has two "tables". One is to store the WOF ID lookup and looks like this:

whosonfirst_id, repo_id

And one to store the repo ID and it's corresponding name:

repo_name, repo_id

The idea being that storing string repo names for every record is a waste of space and processing time. Although it may probably be the case that any given finding aids will map to a single WOF repo it is possible for a finding aid to contain pointers to records from multiple repositories.

As of this writing there are three different pre-compiled indices:

Protobuffers - These are compact and efficient but prone to out-of-memory errors when producing an index for a repo will lots of records. Fundamentally since the protobuffer has to be stored in memory before being written to disk there will always be an "upper limit" problem.
SQLite databases - These are reasonably fast to create and reasonably compact uncompressed. The set of SQLite finding aids for all of the whosonfirst-data-admin- repos is, uncompressed, 77MB.
CSV "archives" - This is a tar archive consisting of two files: catalog.csv and sources.csv. These are faster to create than the SQLite databases and much smaller. I am still doing an initial run of CSV archives for the whosonfirst-data-admin repositories but if the SQLite database for China is 10MB the CSV archive is only 1.8MB.

Right now, I am inclined to:

Favour the CSV archives because they are so much smaller and by extension faster to download over the wire. There is already a handy csv2sql tool for populating a local SQLite database from (n) CSV archives but I haven't done timings yet.
Favour publishing per-repo indices rather than, say, a single index for all the admin repos, all the postalcode repos, and so on. The reason is mostly because only a small subset of repos get changed in any given window of time. It will be easier and faster to automate updates by querying GitHub directly for repos than have changed since (n) and rebuilding only those datasets. Some repos, like France or China or the US, take a while to index but most of the others happen in minutes or seconds; this would allow us to run a plain-vanilla container on a schedule rather than adding more code to trigger (and maintain) actions on commit.

It would be easy enough to create a whosonfirst-data/findingaid repo but I am wondering whether it makes sense to store them in this repo (whosonfirst-data/whosonfirst-data) ?

Thoughts?

thisisaaronland · 2021-10-29T03:21:00Z

thisisaaronland
Oct 29, 2021
Maintainer Author

Update: All the whosonfirst-data-admin- pointers encoded in CSV archives:

$> du -h -d 1 /usr/local/data/findingaid/csv/
15M     /usr/local/data/findingaid/csv/

$> time ./bin/csv2sql -database-uri 'sql://sqlite3?dsn=admin.db' /usr/local/data/findingaid/csv/*.gz

real	1m49.170s
user	1m31.838s
sys	0m22.015s

$> sqlite3 admin.db 
SQLite version 3.7.17 2013-05-20 00:56:22
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> SELECT COUNT(id) FROM catalog;
4930544

$> du -h admin.db 
81M	admin.db

0 replies

thisisaaronland · 2021-10-29T15:24:12Z

thisisaaronland
Oct 29, 2021
Maintainer Author

Admin data and postal data (CSV archives):

$> du -h -d 1 /usr/local/data/findingaid/csv/
20M	/usr/local/data/findingaid/csv/

0 replies

thisisaaronland · 2021-10-29T17:15:47Z

thisisaaronland
Oct 29, 2021
Maintainer Author

A working implementation of a finding aid "reader":

$> ./bin/read \
	-reader-uri 'findingaid://?dsn=/usr/local/data/findingaids/wof.db' \
	102527513 \
	
| jq '.["properties"]["wof:name"]'

"San Francisco International Airport"

Where wof.db is a 20MB SQLite database containing all the WOF admin and postalcode records.

Under the hood the code is using a URI template to construct an HTTP "reader" using the corresponding repository's absolute URL for the data folder. The WOF ID is converted in to a (nested) relative URL which is then passed to the HTTP reader:

https://github.com/whosonfirst/go-reader-findingaid/blob/main/findingaid.go#L86-L150

Because URI templates are used to define new go-reader instances the finding aid reader is not limited to HTTP retrievals but can use any reader packages have been imported:

https://github.com/whosonfirst/go-reader#available-readers

0 replies

thisisaaronland · 2021-10-29T17:45:33Z

thisisaaronland
Oct 29, 2021
Maintainer Author

Admin and postal code finding aid data is now available here:

https://github.com/whosonfirst-data/whosonfirst-findingaids

0 replies

nvkelso · 2021-11-03T04:14:33Z

nvkelso
Nov 3, 2021
Collaborator

It would be easy enough to create a whosonfirst-data/findingaid repo but I am wondering whether it makes sense to store them in this repo (whosonfirst-data/whosonfirst-data) ?

How big would they be? Either way I think whosonfirst-data/findingaid (or similar) is more appropriate. After all we have a million repos already, why not create another one? :)

3 replies

thisisaaronland Nov 3, 2021
Maintainer Author

The (CSV archive) finding aids for all the admin and postalcode data are 19MB. The finding aids for venues are about 40MB.

The admin/postalcode data when exported to a SQLite finding aid end up being about 100MB. I have a working prototype of the whosonfirst-browser working with one of those databases stored on disk.

The next step is figuring out how/where update the whosonfirst-browser, running in AWS, to have a copy of a finding aid database that can be queried and updated to reflect the changes to the finding aids. Options include:

Lambda (APIGW) is problematic because it adds significant costs to startup time in order to retrieve the data (from say S3) not to mention adding those costs for every request.
AppRunner seems like it might be ideal but I don't know what the persistence of data written to disk is across invocations. Do the containers instances that AppRunner fronts get torn down if they are silent for (n) minutes? Does CloudFront play nicely with AppRunner?
ECS/Fargate. I would prefer not to go down this road because it's such a nuisance to configure.
The whosonfirst-browser could bundle the CSV archive files using the Go embed package and simply create the SQLite database on startup. On a standard twenty-teens laptop this takes about 1.x seconds but I don't know how long it would take in the "cloud". It also wouldn't reflect changes to the finding aids without extra code and triggers to update the data in the whosonfirst-browser package and then the Lambda function itself.
Going the Lambda (APIGW) route but querying the (SQLite) database using the psanford/sqlite2vfshttp package. This may be prohibitively slow and there would be no benefits for caching things in memory since it would be an ephemeral Lambda function.
Not using the SQLite databases but instead writing the data to DynamoDB table (something like the csv2sql utility but for DynamoDB) and updating the go-reader-findingaid package to query it. It might be possible to use the psanford/donutdb package to do all of this using the existing SQLite layer but I am inclined to think it would be better, long-term, just to add support where needed for DynamoDB.

Of that list:

The last option seems like it is the most realistic (and is portable since you can run your own instance of DynamoDB if necessary). It also means the whosonfirst-browser code doesn't need to concern itself with finding aid updates. That can happen out of band, in a separate process.
The second-to-last option seems like it is worth experimenting with if only to confirm/deny whether the query times are prohibitive.

micahwalter Nov 3, 2021

Interesting - I have a few thoughts:

I like the DynamoDB approach. It seems like the solution with least number of parts and networking issues. You'd need a tool to update DynamoDB from your CSV data on GitHub. This seems like something you could easily do with Lambda on a periodic basis. You'd also need to update your whosonfirst-browser code to be able to read from DynamoDB, and you'd need an API Gateway to handle requests to whosonfirst-browser running in a Lambda function.
Another more complicated approach, but one that still uses the SQLite, would be to host the SQLite DB on an EFS volume. You'd have to configure your Lambdas for updating SQLite and for running whosonfirst-browser to connect to the VPC where your EFS volume lives, and you'd still need the API Gateway to handle requests to whosonfirst-browser.
Nothing wrong with using ECS/Fargate if you provision it through the AWS CDK, it's pretty simple to configure. You'll have a running cost for the service and the ELB involved.
I'd wait on AppRunner to mature a bit in 2022 as it currently can't talk to resources in a VPC. AppRunner also has a running cost.

nvkelso Nov 3, 2021
Collaborator

I've had good results with DynamoDB in a parallel life :)

thisisaaronland · 2021-11-04T22:11:14Z

thisisaaronland
Nov 4, 2021
Maintainer Author

This is all still in branches but:

$> cd /usr/local/data/dynamodb
$> java -Djava.library.path=./DynamoDBLocal_lib -jar DynamoDBLocal.jar -sharedDb

$> cd /usr/local/whosonfirst/go-whosonfirst-findingaid
$> go run -mod vendor cmd/create-dynamodb-tables/main.go -dynamodb-uri 'awsdynamodb://findingaid?region=us-west-2&endpoint=http://localhost:8000&credentials=static:local:local:local' -refresh
$> make cli && ./bin/populate -producer-uri 'awsdynamodb://findingaid?region=us-west-2&endpoint=http://localhost:8000&credentials=static:local:local:local&partition_key=id' /usr/local/data/sfomuseum-data-maps/

$> cd /usr/local/whosonfirst/go-reader-findingaid
$> make cli && ./bin/read -reader-uri 'findingaid://awsdynamodb/findingaid?region=us-west-2&endpoint=http://localhost:8000&credentials=static:local:local:local&partition_key=id&template=https://raw.githubusercontent.com/sfomuseum-data/{repo}/main/data/' 1360391327 | jq '.["properties"]["wof:name"]'

"SFO (1988)"

0 replies

thisisaaronland · 2021-11-05T16:58:27Z

thisisaaronland
Nov 5, 2021
Maintainer Author

Ballpark costs for DynamoDB setup assuming:

Datasets ranging from 5M records (SFOM inclusive of flight data) to 25M (WOF inclusive of venues)
Document size of 4096kb, representing the JSON struct of a DynamoDB record rather than just column values.
The actual size of the final dataset. I am currently indexing the SFOM data locally and the shared-local-instance.db file is about 1GB for 1.5M records.

            5GB @ 4kb -> 1.25/mo 
            25GB @ 4kb -> 6.25/mo 

            5M wr/ mo -> 25$ (initial setup)
            25M wr / mo -> 125$	(initial setup)

            100K wr / mo -> 0.50$ (updates)

            5M rd / mo -> 0.63$
            25M rd / mo -> 3.13$

2 replies

nvkelso Nov 9, 2021
Collaborator

Another useful trick is to store smaller records in DynamoDB with the option of full records still on S3 when larger records. Something something SPR.

thisisaaronland Nov 9, 2021
Maintainer Author

Yes. To be clear the finding aid is only storing WOF ID -> WOF repo lookups. All the code to resolve an ID to an actual WOF document is handled by other code/logic. For example:

https://github.com/whosonfirst/go-reader-findingaid/blob/main/reader.go

Assuming that the DynamoDB stuff for finding aids work it would be interesting to consider extending it to "Something something SPR".

thisisaaronland · 2021-11-06T00:25:17Z

thisisaaronland
Nov 6, 2021
Maintainer Author

This is a thing that works:

bin/whosonfirst-browser -enable-all \
  -nextzen-tilepack-database /usr/local/nextzen-world-2019-1-10.db \
  -reader-uri 'findingaid://awsdynamodb/findingaid?region=local&endpoint=http://localhost:8000&credentials=static:local:local:local&partition_key=id&template=https://raw.githubusercontent.com/sfomuseum-data/{repo}/main/data/'

And when I visit http://localhost:8080/id/1729644717 I see this:

Tile data is loaded from /usr/local/nextzen-world-2019-1-10.db
WOF data for record as well as all its relations (other WOF IDs) is loaded from sfomuseum-data GitHub repos, using a local DynamoDB finding aid to determine which repo to fetch data from.

1 reply

thisisaaronland Nov 6, 2021
Maintainer Author

Related: whosonfirst/go-reader-findingaid#1

thisisaaronland · 2021-11-06T00:38:16Z

thisisaaronland
Nov 6, 2021
Maintainer Author

I am uncertain why this is being terminated by Docker, locally, otherwise it works:

whosonfirst-data/whosonfirst-findingaids@aa7d39c

$> docker run whosonfirst-data-findingaid /usr/local/bin/update-findingaids.sh -T 'constant://?val={TOKEN}' -O 604800
Fetch repos updated since 1635552129 (offset 604800 seconds since now)
Cloning into '/usr/local/data/whosonfirst-findingaid'...
Filtering content: 100% (472/472), 18.17 MiB | 962.00 KiB/s, done.
Update finding aid for whosonfirst-data-admin-tr
processed 0 records in 1m0.0075347s (started 2021-11-06 00:02:51.6479084 +0000 UTC m=+0.072559101)
2021/11/06 00:04:20 time to index paths (1) 1m29.3119569s
real	1m 29.52s
user	1m 48.98s
sys	0m 36.24s
Update finding aid for whosonfirst-data-admin-us
processed 0 records in 1m0.0005662s (started 2021-11-06 00:04:21.0164394 +0000 UTC m=+0.014091801)
processed 0 records in 2m0.000146s (started 2021-11-06 00:04:21.0164394 +0000 UTC m=+0.014091801)
processed 0 records in 3m0.0005198s (started 2021-11-06 00:04:21.0164394 +0000 UTC m=+0.014091801)
processed 0 records in 4m0.0008231s (started 2021-11-06 00:04:21.0164394 +0000 UTC m=+0.014091801)
processed 0 records in 5m0.0001708s (started 2021-11-06 00:04:21.0164394 +0000 UTC m=+0.014091801)
processed 0 records in 6m0.001019s (started 2021-11-06 00:04:21.0164394 +0000 UTC m=+0.014091801)
processed 0 records in 7m0.0001856s (started 2021-11-06 00:04:21.0164394 +0000 UTC m=+0.014091801)
processed 0 records in 8m0.0003361s (started 2021-11-06 00:04:21.0164394 +0000 UTC m=+0.014091801)
processed 0 records in 9m0.0002974s (started 2021-11-06 00:04:21.0164394 +0000 UTC m=+0.014091801)
processed 0 records in 10m0.0063814s (started 2021-11-06 00:04:21.0164394 +0000 UTC m=+0.014091801)
processed 0 records in 11m0.0051693s (started 2021-11-06 00:04:21.0164394 +0000 UTC m=+0.014091801)
processed 0 records in 12m0.0151729s (started 2021-11-06 00:04:21.0164394 +0000 UTC m=+0.014091801)
processed 0 records in 13m0.0042336s (started 2021-11-06 00:04:21.0164394 +0000 UTC m=+0.014091801)
processed 0 records in 14m0.0033819s (started 2021-11-06 00:04:21.0164394 +0000 UTC m=+0.014091801)
processed 0 records in 15m0.0067509s (started 2021-11-06 00:04:21.0164394 +0000 UTC m=+0.014091801)
processed 0 records in 16m0.0117186s (started 2021-11-06 00:04:21.0164394 +0000 UTC m=+0.014091801)
processed 0 records in 17m0.0150098s (started 2021-11-06 00:04:21.0164394 +0000 UTC m=+0.014091801)
processed 0 records in 18m0.010924s (started 2021-11-06 00:04:21.0164394 +0000 UTC m=+0.014091801)
processed 0 records in 19m0.0184981s (started 2021-11-06 00:04:21.0164394 +0000 UTC m=+0.014091801)
processed 0 records in 20m0.0012859s (started 2021-11-06 00:04:21.0164394 +0000 UTC m=+0.014091801)
processed 0 records in 21m0.0215199s (started 2021-11-06 00:04:21.0164394 +0000 UTC m=+0.014091801)
processed 0 records in 22m0.0257521s (started 2021-11-06 00:04:21.0164394 +0000 UTC m=+0.014091801)
processed 0 records in 23m0.0301767s (started 2021-11-06 00:04:21.0164394 +0000 UTC m=+0.014091801)
processed 0 records in 24m0.0419845s (started 2021-11-06 00:04:21.0164394 +0000 UTC m=+0.014091801)
processed 0 records in 25m0.0144574s (started 2021-11-06 00:04:21.0164394 +0000 UTC m=+0.014091801)
processed 0 records in 26m0.0025042s (started 2021-11-06 00:04:21.0164394 +0000 UTC m=+0.014091801)
Command terminated by signal 9
real	26m 37.93s
user	29m 1.01s
sys	6m 50.79s
[main aa7d39c] update finding aids for  whosonfirst-data-admin-tr whosonfirst-data-admin-us
1 file changed, 1 insertion(+), 1 deletion(-)
To https://github.com/whosonfirst-data/whosonfirst-findingaids.git
  60dd92d..aa7d39c  main -> main

I will wire this in to the WOF ECS tasks to run every (n) hours shortly.

https://github.com/whosonfirst-data/whosonfirst-findingaids/blob/main/bin/update-findingaids.sh

0 replies

straup · 2021-11-07T21:07:01Z

straup
Nov 7, 2021
Maintainer

whosonfirst-data/whosonfirst-findingaids@bc2464e

This was performed by an ECS task. Still need to sort out GH credentials for automated tasks.

0 replies

straup · 2021-11-09T06:46:30Z

straup
Nov 9, 2021
Maintainer

whosonfirst-data/whosonfirst-findingaids@5d9cfdc

This was performed by an ECS task pulling its GH credentials from an AWS parameter store and saving work to GH as a dedicated whosonfirst-bot user. This is now a scheduled task set to run every 6 hours.

This only updates the CSV files in the whosonfirst-findingaid repo but that's the first step towards a DynamoDB-backed finding aid.

0 replies

Who's On First Data

Finding aids "v2" #1967

Uh oh!

Uh oh!

thisisaaronland Oct 29, 2021 Maintainer

Replies: 11 comments · 6 replies

Uh oh!

thisisaaronland Oct 29, 2021 Maintainer Author

Uh oh!

thisisaaronland Oct 29, 2021 Maintainer Author

Uh oh!

thisisaaronland Oct 29, 2021 Maintainer Author

Uh oh!

thisisaaronland Oct 29, 2021 Maintainer Author

Uh oh!

nvkelso Nov 3, 2021 Collaborator

Uh oh!

Uh oh!

thisisaaronland Nov 3, 2021 Maintainer Author

Uh oh!

micahwalter Nov 3, 2021

Uh oh!

nvkelso Nov 3, 2021 Collaborator

Uh oh!

thisisaaronland Nov 4, 2021 Maintainer Author

Uh oh!

Uh oh!

thisisaaronland Nov 5, 2021 Maintainer Author

Uh oh!

nvkelso Nov 9, 2021 Collaborator

Uh oh!

thisisaaronland Nov 9, 2021 Maintainer Author

Uh oh!

thisisaaronland Nov 6, 2021 Maintainer Author

Uh oh!

thisisaaronland Nov 6, 2021 Maintainer Author

Uh oh!

thisisaaronland Nov 6, 2021 Maintainer Author

Uh oh!

straup Nov 7, 2021 Maintainer

Uh oh!

straup Nov 9, 2021 Maintainer

thisisaaronland
Oct 29, 2021
Maintainer

Replies: 11 comments 6 replies

thisisaaronland
Oct 29, 2021
Maintainer Author

thisisaaronland
Oct 29, 2021
Maintainer Author

thisisaaronland
Oct 29, 2021
Maintainer Author

thisisaaronland
Oct 29, 2021
Maintainer Author

nvkelso
Nov 3, 2021
Collaborator

thisisaaronland Nov 3, 2021
Maintainer Author

nvkelso Nov 3, 2021
Collaborator

thisisaaronland
Nov 4, 2021
Maintainer Author

thisisaaronland
Nov 5, 2021
Maintainer Author

nvkelso Nov 9, 2021
Collaborator

thisisaaronland Nov 9, 2021
Maintainer Author

thisisaaronland
Nov 6, 2021
Maintainer Author

thisisaaronland Nov 6, 2021
Maintainer Author

thisisaaronland
Nov 6, 2021
Maintainer Author

straup
Nov 7, 2021
Maintainer

straup
Nov 9, 2021
Maintainer