Skip to content

(0.5.1) Metadata for JRA55 #286

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 106 commits into from
Mar 11, 2025
Merged

(0.5.1) Metadata for JRA55 #286

merged 106 commits into from
Mar 11, 2025

Conversation

simone-silvestri
Copy link
Collaborator

@simone-silvestri simone-silvestri commented Dec 5, 2024

This PR is an initial proposal to generalize ECCOMetadata to Metadata and rework the JRA55 module to use Metadata. In this way, we can have different JRA55 versions (repeat year and multiple year) and we can define a download_dataset function to download the dataset independently of using JRA55 as we can do for ECCO

This PR also removes the ability to generate a JRA55FieldTimeSeries directly interpolated on the ocean grid, since we need to interpolate anyways when we compute fluxes

closes #182

@glwagner
Copy link
Member

glwagner commented Dec 5, 2024

I think we should get #251 and #284 merged first, do you want to help with those? Otherwise we will have conflicts.

@simone-silvestri
Copy link
Collaborator Author

sounds good

@glwagner
Copy link
Member

glwagner commented Mar 9, 2025

I think the difference is whether we want Metadata to be part of the user interface or just an internal convenience type that helps us wrangle data.

In the first case, it is nice to be able to represent a dataset composed of a version, a name, and a set of dates in a type (here we can probably go the Metadata - Metadatum route), while a Vector{Metadata} might not be a consistent dataset because there can possibly be differences in versions or variable names.

In the latter case, there is no problem with the user interface, Metadata can be hidden in the internals with the user interface exposing only the methods that explicitly accept variable_name, dates, and version for things like ECCOFieldTimeSeries and JRA55FieldTimeSeries. We would have to think about how to change the set! function to remove the Metadata option. In the internals, when we pass a Vector{Metadata} to functions like FieldTimeSeries we need to ensure consistency of variable_name and version of all the elements of the vector.

Ok, please clarify what you see as the trade-offs for user interface. I think you are assuming a design, but it is not being explicitly state. I can't respond or judge what user interface you are referring to, unless you state it explicitly.

Part of the problem is the proposal to define someting like

struct Metadatum # represents a single file
    # properties
end

const Metadata = Vector{Metadatum}

has no specific implications for how Metadata is constructed. We can keep the same constructor.

The difference is mainly that we could define a new constructor for a single Metadatum. This avoids the confusion that we pass dates = a_single_date to Metadata.

So in summary i don't see what changes would be required of the user interface. The difference is that we can expand the user interface to make more sense while leaving existing components unchanged.

@glwagner
Copy link
Member

glwagner commented Mar 9, 2025

Here's a simple example, just one of many possibilities

# source
Metadata(name; version, dates) = [Metadatum(name; version, date for date in dates]
datum = Metadatum(name; version, date)
data = Metadata(name; version, dates)

the idea of a "version" for general Metadata is weird to me by the way. What concept are we expressing here. data_source or just source? Version implies that a data is specific but has multiple "versions" or "updates" which doesn't seem right.

@simone-silvestri
Copy link
Collaborator Author

I was referring to achieving something similar avoiding a Vector that might mix different versions, I guess the end goal is the same.
An example is

Metadatum
   name
   version 
   date
   dir
end

@propagate_inbounds Base.getindex(m::Metadata, i::Int) = Metadatum(m.name, m.dates[i],   m.version, m.dir)
@propagate_inbounds Base.first(m::Metadata) = Metadatum(m.name, m.dates[1],   m.version, m.dir)
@propagate_inbounds Base.last(m::Metadata) = Metadatum(m.name, m.dates[end], m.version, m.dir)

@inline function Base.iterate(m::Metadata, i=1)
    if (i % UInt) - 1 < length(m)
        return Metadatum(m.name, m.dates[i], m.version, m.dir), i + 1
    else
        return nothing
    end
end

also in this way it would be possible to do

datum = Metadatum(name; version, date)
data = Metadata(name; version, dates)

@glwagner
Copy link
Member

glwagner commented Mar 9, 2025

I guess I was seeing the ability to mix "versions" as a perk. For example JRA55 was originally generated up to 2018, but there are updates which continue the dataset past that. You may not be able to form a single consistent dataset (eg a single "version") that encompasses all dates, but it still could be valid to write something like

up_to_2018 = Metadata(name; dataset=original_dataset, dates=dates_till_2018)
past_2018 = Metadata(name; dataset=continuation, dates=dates_past_2018)
data = vcat(up_to_2018, past_2018)

by the way what does "version" mean in the context of a general Metadatum? It seems we need a different word like "dataset" or "label" which is more general than just "version"

@simone-silvestri
Copy link
Collaborator Author

simone-silvestri commented Mar 9, 2025

I guess version is more suited in main where Metadata is ECCOMetatada and the dataset can only be part of the ECCO dataset with a specific version (2, 4, etc).

You are right that that field name has to change in this PR, I like dataset because it points to the dataset the metadata points to

@glwagner
Copy link
Member

glwagner commented Mar 9, 2025

I guess version is more suited in main where Metadata is ECCOMetatada and the dataset can only be part of the ECCO dataset with a specific version (2, 4, etc).

You are right that that field name has to change in this PR, I like dataset because it points to the dataset the metadata points to

eg

struct ECCODataset
    version :: Symbol
end

am I right that the path to a particular file is determined by a combination of date, dir, and dataset / version?

@simone-silvestri
Copy link
Collaborator Author

yep, in general the full file path is determined by the whole metadatum (name, date, dataset / version, and dir)

@glwagner
Copy link
Member

glwagner commented Mar 9, 2025

Ruminating on "dataset" --- I think it could be a good term, because it expresses the concept of a "category of data". Which is what we mean here, a single metadatum refers to one file; a whole dataset will have many files, each of which has a metadatum.

Also just to offer an alternative --- rather than metadatum / metadata, we could have

struct Metadata
    name
    dataset
end

const MetadataSeries = Vector{Metadata}

semantically, MetadataSeries is easier to distinguish from Metadata than Metadata is to distinguish from Metadatum. It might help too, to emphasize that a vector of Metadata is a series of snapshots.

@simone-silvestri
Copy link
Collaborator Author

I think I can merge here even if the examples are not passing. We are hitting a scalar indexing problem, which I believe is connected to CliMA/Oceananigans.jl#4087 but is not part of this PR.

Then we can think about an interface for metadata that allows distinguishing between dates and date

@simone-silvestri simone-silvestri changed the title (0.5.0) Metadata for JRA55 (0.5.1) Metadata for JRA55 Mar 11, 2025
@simone-silvestri simone-silvestri changed the title (0.5.1) Metadata for JRA55 (0.6.0) Metadata for JRA55 Mar 11, 2025
@simone-silvestri simone-silvestri changed the title (0.6.0) Metadata for JRA55 (0.5.1) Metadata for JRA55 Mar 11, 2025
@simone-silvestri
Copy link
Collaborator Author

The errors in the examples are not due to this PR. I opened an issue with the error. Let's not delay this PR further. I will merge this, then we can look at differences in Metadata / Metadatum and we probably should take some time in resolving all the errors in the examples.

@simone-silvestri simone-silvestri merged commit e8516e8 into main Mar 11, 2025
7 of 9 checks passed
@simone-silvestri simone-silvestri deleted the ss/metadata-for-everything branch March 11, 2025 10:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build docs Add this label to built the docs in a PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactoring JRA55
3 participants