-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Add SQL and Apache Arrow as possible datastore formats. Here ae some considerations:
- SQL
Need to execute all DDL up front based on model run specifications. Need proper mapping to Database + Tables (hierarchical Group/Table structure not directly supported; needs a mapping). The Global group is probably a separate set of tables. The Year groups need to be identified with a Year column in a single table. When first adding to a Table, make sure that "missing future columns" (produced later in the model run) are able to be added with NULL values. Need to handle Add-versus-Update successfully (pre-query existence of key rows based on year and scenario).
Handling model stages and scenarios also needs to be considered. Do the scenarios/model stages accumulate in single tables (with a Scenario identifier)? That is probably desirable (when subsequent stages with additional fields being added are created, we would need to Alter Table once at the outset of that scenario but only if the required columns do not already exist.
So the aim would be one Database per model, with the tables set up based on the modules that are run across all stages. Set up in each stage will entail checking that tables and columns exist and executing DDL to create them if not.
Implication of all that is that Datastore "chaining" is a logical operation performed in place: possibly cache row selection keys to identify the scenario. SQL Datastores are already "flat" in that all scenarios exist in one place, and the Datastore Path is a "logical" search that maps a search in the current scenario into a search in earlier scenarios for other columns. The set of DatastorePaths creates a "View" by building up a set of scenarios rather than a single scenario when selecting rows (writing always goes into the current scenario, reading will reach for the previous update). When a new scenario runs, remember that columns that might be pulled from an earlier scenario could already be there but with "NULL". May need some fun self-joining with limits to particular columns in order to come up with a "View" of the total Datastore - join would be on year and geography, and only a subset of columns from each version of the table join, yielding a single virtual table... So we'll need to keep track in each element of DatastorePath of which columns should be accessed from that Scenario.
An alternative would be to keep the hierarchy, where a set of tables is constructed for each scenario, and the DatastorePath tells us where to look for specific columns in each scenario (so it won't all be flat). That would be simpler to architect, at the expense of having more tables with less clear relationships to each other.
Consider this: https://learnsql.com/blog/how-to-query-hierarchical-data/
Of this: https://www.databasestar.com/sql-cte-with/
Good old VIEWs are also in play here: the DatastorePath entails construction of a view on a certain table (joined by year and and geography and selected by scenario). So the DatastorePath (and the table of fields "offered" at that point on the path, which will itself be a table, possibly shared among all scenarios/stages) is used to create a VIEW that can be constructed when a stage starts. Catch is that the VIEW needs to be updated when the current stage writes to a field that was initially furnished by an earlier stage. So need to watch the write operation, and if we add/replace values in a column for the current scenario (where it's always written), we update the view to pull subsequent reads from the newly written field instead of what was available in the earlier stage.
So SQL table and query management is going to be tricky!