Technical requirements for a data warehouse #6

cmutel · 2024-04-12T12:26:03Z

cmutel
Apr 12, 2024
Maintainer

A data lake is a place you can put unstructured or unnormalized data.
A data warehouse has structure and normally normalization.
A data lakehouse is a compromise which supports both paradigms.

We will start with a data warehouse. But what does that mean for us? We definitely don't want to commit to a specific big data platform now, and maybe not ever. But we also don't want to be paying fees to transfer massive amounts of data across systems every time we call a function. We need a technical solution which supports our API and makes it feel like magic for end users.

Bonus points for open source library implementations and the ability to mirror or federate data, so that partner institutions can donate computational resources as in-kind contributions.

cmutel · 2024-04-13T09:28:12Z

cmutel
Apr 13, 2024
Maintainer Author

Some interest from the group to use Parquet - it can do geospatial data, it can include indices, is column-oriented. @tngTUDOR also mentioned Cassandra, a NoSQL database.

0 replies

cmutel · 2024-09-28T06:57:43Z

cmutel
Sep 28, 2024
Maintainer Author

I looked into various data platforms. Most big data solutions are addressing problems we don't have. In particular, we will have a lot of small data files, and the data files could only have 1 row each (though we hope for more in the future). We have no need to process massive data streams or provide real-time analytics.

What do we actually need?

Store a data vector or array. Arrays are only used when multiple observation sets are correlated, so that all data in a given row should be used together (think fuel input and CO2 output).
Store metadata about the data vector - most importantly the vocab terms for the columns, but also the spatial bounding box and temporal limits, and a reference to the dataset it comes from. For datasets we have standard metadata, like version, author, license, source. All of this fits very nicely in a relational database, and we know how to write quick APIs and indices.
We can store the raw numeric data as binary blobs which are numpy arrays. They can be in the database, they will normally be quite small and if this ever becomes a performance bottleneck we can switch out to a foreign data wrapper.

Current decision: We will use Postgres to store the numeric data. Here is a preliminary schema:

BEGIN;
CREATE EXTENSION postgis;

CREATE TABLE dataset (
    id INTEGER PRIMARY KEY,
    metadata JSONB, # Includes author, license, version, source - this will evolve, enforce schema on client side
    outdated BOOL,
);
CREATE TABLE location (
    id INTEGER PRIMARY KEY,
    metadata JSONB,
    geom geometry(MultiPolygon,4326),
);
CREATE TABLE observations (
    id INTEGER PRIMARY KEY,
    dataset INTEGER REFERENCES dataset(id),
    location REFERENCES location(id),
    start_date DATE,  # Probably there is a better way with interval but this is easy to understand and implement
    end_date DATE,
    data BYTEA,
);

-- Plus some indices
COMMIT;

One thing which I am not sure on are how to index the column terms in observations, as there can be more than one column.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Technical requirements for a data warehouse #6

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Technical requirements for a data warehouse #6

Uh oh!

cmutel Apr 12, 2024 Maintainer

Replies: 2 comments

Uh oh!

cmutel Apr 13, 2024 Maintainer Author

Uh oh!

cmutel Sep 28, 2024 Maintainer Author

cmutel
Apr 12, 2024
Maintainer

cmutel
Apr 13, 2024
Maintainer Author

cmutel
Sep 28, 2024
Maintainer Author