Replies: 2 comments
-
Some interest from the group to use Parquet - it can do geospatial data, it can include indices, is column-oriented. @tngTUDOR also mentioned Cassandra, a NoSQL database. |
Beta Was this translation helpful? Give feedback.
-
I looked into various data platforms. Most big data solutions are addressing problems we don't have. In particular, we will have a lot of small data files, and the data files could only have 1 row each (though we hope for more in the future). We have no need to process massive data streams or provide real-time analytics. What do we actually need?
Current decision: We will use Postgres to store the numeric data. Here is a preliminary schema: BEGIN;
CREATE EXTENSION postgis;
CREATE TABLE dataset (
id INTEGER PRIMARY KEY,
metadata JSONB, # Includes author, license, version, source - this will evolve, enforce schema on client side
outdated BOOL,
);
CREATE TABLE location (
id INTEGER PRIMARY KEY,
metadata JSONB,
geom geometry(MultiPolygon,4326),
);
CREATE TABLE observations (
id INTEGER PRIMARY KEY,
dataset INTEGER REFERENCES dataset(id),
location REFERENCES location(id),
start_date DATE, # Probably there is a better way with interval but this is easy to understand and implement
end_date DATE,
data BYTEA,
);
-- Plus some indices
COMMIT; One thing which I am not sure on are how to index the column terms in |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
A data lake is a place you can put unstructured or unnormalized data.
A data warehouse has structure and normally normalization.
A data lakehouse is a compromise which supports both paradigms.
We will start with a data warehouse. But what does that mean for us? We definitely don't want to commit to a specific big data platform now, and maybe not ever. But we also don't want to be paying fees to transfer massive amounts of data across systems every time we call a function. We need a technical solution which supports our API and makes it feel like magic for end users.
Bonus points for open source library implementations and the ability to mirror or federate data, so that partner institutions can donate computational resources as in-kind contributions.
Beta Was this translation helpful? Give feedback.
All reactions