Data storage API #5

cmutel · 2024-04-12T11:47:35Z

cmutel
Apr 12, 2024
Maintainer

After normalization to the glossary, the current assumption is that most data is tabular (CSVs, vector geospatial data, time-series). Our basic conception is that this tabular data has columns as parameters of interest and rows as observations. Measured or modelled data which includes correlated observations should have those correlated values in the same row.

Tabular data has a given context - the metadata which describe what the dataset is about. This would include when the observations were made (and the time interval they are valid for), what technology was measured, who produced the good or service, etc.

Probably we will use something like Parquet, which has a lot of energy behind it, is already column-oriented, has standardized ways to describe metadata and versioning, and support geospatial data.

But what would an API look like? @cmutel has a very strong preference towards building on simple building blocks (as we will change in the future in any case), and we aren't ready to commit to ourselves and our partners to enterprise solutions like Snowflake, BigQuery, RedShift, or DataBricks.

The use cases here are relative simple:

Add new data files
Add indexes to make searches work better
Given filter criteria, return data files (or sections of those files)

cmutel · 2024-04-13T09:40:13Z

cmutel
Apr 13, 2024
Maintainer Author

@schoi839 notes that some models might want to write back "raw" data to the datastore - for example gap-filling models. In theory the raw data should be independent observations; in practice we will need to use simulation outputs. Should we allow for models during calculation to write to the tabular data store?

2 replies

mklarmann Apr 15, 2024

Our data model (and storage) differentiates: "mutations" to a graph and the "nodes" of a graph. Both are exposed as those models directly in the API. And currently exposes or consumes no tabular data.

Nodes (and edges) are precomputed sub-graphs, which gap-filling modules can access to build sections of the full-calculation graph. E.g. currently the ecoinvent database for example.

Mutations are created by users and gap-filling modules and can be injected while building the graph (in between, if necessary). The calculation can be done whenever we want and is independent. All mutations can be stored a log, whenever we want - and are preserved for a deterministic review of what happened.

cmutel Sep 28, 2024
Maintainer Author

Current Decision: We do what @mklarmann suggests. Models cannot write to the data store, but can generate their own data during a model run. This data is preserved in the logs and the data lineage system

jsvgoncalves · 2024-04-13T12:12:56Z

jsvgoncalves
Apr 13, 2024
Maintainer

In order to properly answer this question we need first to understand:

Which (types of) models we have
What does the I/O interface for these models look like. It is reasonable to assume that some models while have graphs as I/O.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data storage API #5

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Data storage API #5

Uh oh!

cmutel Apr 12, 2024 Maintainer

Replies: 2 comments · 2 replies

Uh oh!

cmutel Apr 13, 2024 Maintainer Author

Uh oh!

mklarmann Apr 15, 2024

Uh oh!

cmutel Sep 28, 2024 Maintainer Author

Uh oh!

jsvgoncalves Apr 13, 2024 Maintainer

cmutel
Apr 12, 2024
Maintainer

Replies: 2 comments 2 replies

cmutel
Apr 13, 2024
Maintainer Author

cmutel Sep 28, 2024
Maintainer Author

jsvgoncalves
Apr 13, 2024
Maintainer