Feedback and ideas to improve dataset_description IDS #63

prasad-sawantdesai · 2025-05-06T08:16:56Z

prasad-sawantdesai
May 6, 2025
Collaborator

Hello,

We need to discuss few fields in dataset_description and have opinion about them.

Do we see purpose of storing uri within the dataset_description IDS?
https://imas-data-dictionary.readthedocs.io/en/latest/generated/ids/dataset_description.html#dataset_description-type

When a simulation is created, URI can point to the local path. However, unless explicitly updated, that URI remains unchanged—even if the data entry files are later moved or copied to a different location. As a result, the stored URI might no longer be valid or relevant.

Should we store GUID or something which can uniquely identify the simulation regardless of folder or URI?
I would appreciate your opinion on this.

Extending data_type_identifier within dataset_description type

While scrapping through multiple yaml file which are used for storing data entries. We found predictive type, should we add it in the data_type_identifier? Currenty we have experimental, simulation

https://imas-data-dictionary.readthedocs.io/en/latest/generated/identifier/data_type_identifier.html#identifier-utilities-data_type_identifier.xml

Addition of responsible person field in the dataset_description
In the dataset_description IDS there is a field provider within the structure ids_properties. Most of the times it is a Unix user name which is actually difficult to guess the person.
I would recommend to have structure which specifies name of the responsible person, Simulation who executed the simulation and their email IDs and Organization. This would help users to connect the person who originally did the simulation

Thanks & regards,
Prasad

olivhoenen · 2025-05-06T08:40:58Z

olivhoenen
May 6, 2025
Maintainer

Note that the idea there is to make quickly evolve dataset_description to an active IDS such that its content can serve as schema for part of the metadata one would like to capture when recording/cataloguing a simulation/dataset in IMAS.

ps: note that this IDS was already heavily modified between DD3 and DD4

0 replies

paulotex · 2025-05-06T09:15:16Z

paulotex
May 6, 2025
Collaborator

If uri is filled automatically when dataset_description is create, I think it is useful to know the original path the dataset was created in, even if the dataset moves around. But if uri is just another regular field that is filled by the user as the data is created, I don't see much utility in it and, like Prasad suggest, UUID would be more useful.

0 replies

olivhoenen · 2025-05-07T07:17:45Z

olivhoenen
May 7, 2025
Maintainer

At the moment there is no data filled automatically besides a few only within ids_properties.
Anyway if this was filled in automatically it would be overwritten if you copy the data with the tools that are injecting this information.
How relevant would be, if you move the data to a different host (simdb remote, scp, ...) to have a path if this is not even linked with the host
What would you use this information for?

As currently define (current data entry URI) is see little to no interest in storing a path to self: if you can read this value you must know the URI/path for the data entry. Answering remark 4 above may help identify what's needed to replace this quantity.

0 replies

paulotex · 2025-05-07T07:45:55Z

paulotex
May 7, 2025
Collaborator

The main usefulness I see is to know the origin of the data has it gets copied around. But for that, the URI is really not adequate, since there is no specification what a path means, and the path can change at any time, as a user decides to edit it. I think this is one of your points, Olivier, in your previous comment.

I go back to UUID, that seems more useful. I can imagine how even experimental data will get a UUID assigned, so that it can be filled here.

Of course, the UUID should be updated each time the user changes the data and saves it, and it is not clear how this will be implemented or enforced. It is possible to lose the one-to-one connection between the UUID and the data entry. Perhaps add an automated UUID deletion/creation process in the access layer? On the other hand, since we might be moving away from the need to have an access layer, we might have to add an independent tool that generates and writes an UUID into a data entry / dataset_description ...

0 replies

imbeauf · 2025-05-15T13:44:53Z

imbeauf
May 15, 2025
Maintainer

In DDv3, we put the (user, machine, pulse, run) data in dataset_description so that codes could find this information in the IDS data. This has been turned in URI in DDv4, but I agree it doesn't seem very useful in this form.
Note that the Unique IDentifier for the data can be stored in dataset_fair, since this UID may be minted after the creation of the data.
About dataset_description/type, I am not sure what "predictive" means. In a simulation there are always things that are predicted and others that are kept "as in the experiment", so it's very difficult to have generic qualitative types defined in this regards.
The definition of ids_properties/provider could be extended to include the email address and organization of the person, but then it's something that should be provided manually by the user. I guess in simulation datasets only the Unix user is indicated because it is the only information available to the simulation program. We can discuss modifying the node definition, then it's a matter of process for filling the node with the relevant information.

0 replies

olivhoenen · 2025-05-16T09:52:05Z

olivhoenen
May 16, 2025
Maintainer

I think this discussion jumped too quickly on details around the UUID or DOI notions.

From exchanges so far in this thread I identify a first action of removal the non-clear or ill-purposed fields like uri, type.

I believe the second step will be to better define the role of the dataset_description inside the IDS portfolio. IMO it should be used to store a very high-level description (via a few metadata) about the dataset. Indeed this IDS will never be expected to be used as an IM code coupling interface, and be more leaning towards humans and databases.

Considering that the rest of the structure expects to capture some info about either the pulse on a given experiment or about a simulation, I would find natural to group these two concerns as structures.

By consequence I propose to group all pulse related data under a common struct pulse, and I propose to add duration and placeholders for comments before and after pulse (these are present in the simulation structure only). Finally I propose to remove pulse_processing_time_begin as a processing could then be described in the simulation section:

├── machine
└── pulse/
    ├── comment_before
    ├── time_begin
    ├── time_begin_epoch
    ├── time_end_epoch
    ├── duration
    ├── comment_after

Current structuce (for reference)

├── machine
├── pulse
├── pulse_time_begin
├── pulse_time_begin_epoch
├── pulse_time_end_epoch
├── pulse_processing_time_begin

5 replies

imbeauf May 20, 2025
Maintainer

Some comments on Olivier's previous comment:

We may get rid of URI, ok ...
Although it's difficult to categorize simulation types, I think it's important to keep "type" to distinguish before experimental and simulation data
Duration already exists in the Summary IDS: plasma_duration, I don't think it should be duplicated in dataset_description
I am not a fan of adding a pulse structure: it creates non-backward compatibility for very little added value. Moreover, "pulse" is not equivalent to "experiment", a type=simulation dataset may also relate to an experiment pulse number. I feel rather that only the simulation structure is useful, for metada that are specific to a simulation.
comment_before and comment_after are indeed not specific to a simulation and could be moved out of the simulation structure, with their definition generalized
pulse_processing_time_begin may be replaced indeed by simulation/time_begun if we update its definition and indicate that it represents the begin date of the experimental data processing is type = experimental

olivhoenen May 20, 2025
Maintainer

Thanks for the comments @imbeauf, while discussing this once again we are wondering if it is clear why we have dataset_description in addition to summary, and if it won't be simpler and clearer to store the few remaining static quantities from dataset_description into the summary IDS? (the dataset_description could then simply disappear, which should not be much of a concern as it was having a full makeover anyway from DD3 to 4).

ok fine keeping type
good point I missed the plasma_duration field of the summary
fine, together with your comment on 5

What we are missing also is a well define place where data provider can give a description of the pulse or simulation. This could be done in comment_before but feel that the name poorly reflects this intent (there is also ids_properties/comment, but here also it does not give any indication about what is expected).

Finally, the purpose to store several metadata from the simulation structure is not clear:

time_current (this is not expected to be live data material, so what is expected there?)
time_ended (while time_begun could be use to trace back to when the simulation was running, not sure why its end time would provide more insight and what it would be used for)
time_restart (same comment than time_ended)
workflow (may be redundant with code/parameters or with a generic description of the simulation somewhere else?)

By the way do you have an example of dataset_description used at WEST and what sort of information are you storing in it?

imbeauf May 22, 2025
Maintainer

The initial motivation for separating the Summary and Dataset_Description was that to fill the Summary IDS, you need a specific process that will extract reduced information from the other IDSs and copy it in the Summary. While Dataset_description is more a "manual", human-readable description of the dataset, that you could fill without having the specific data reduction process for the summary data. But you are right, there is maybe no fundamental reason to distinguish the two, since both are describing in a reduced way the content of the dataset.
The manual description of the pulse or simulation is intended to be in dataset_description/ids_properties/comment. We could clarify that in the documentation. If we merge Summary and Dataset_description, we may add a dedicated node.
For the simulation structure: I think it's mostly for JINTRAC that we have all these nodes, the JINTRAC developpers could comment on how they use it. The time_current represents the current time in the simulation, either it's something that is being regularly updated during the JINTRAC workflow, or it's the time at which the simulation stopped if it was interrupted.
I checked a WEST pulse from the last campagian and we are very bad with dataset_description, we write it but leave it completely empty ...

olivhoenen May 23, 2025
Maintainer

Thanks for the extra info @imbeauf. I would indeed favor a dedicated node for storing the description of the dataset, rather than documenting a specific case for the ids_properties/comment. I also note that dataset_description contains info that can be known only after the simulation so it may also be filled in as a post-process (e.g. simulation/comment_after and simulation/time_ended).

@fcasson @koechlf can you comment on the use of dataset_description/simulation structure in JINTRAC, if any? (some of the recent simulations pushed to simdb I could not find a dataset_description at all, or is there a confusion between the workflow IDS and the dataset_description one?).

Looking in ITER scenario DB, ~800 / 1300 simulations have filled in the dataset_description (this number drops to 175 if I discard SOLPS)
), only METIS seems to fill much of the fields under the simulation structure (from which time_begin, time_end and time_step seems the only relevant info).

With all that in mind I'd like to propose to add the useful part of dataset_description into summary and to remove the former from the list of IDSs in DD4. Please comment further if you disagree.

fcasson May 23, 2025

IIRC we (JINTRAC) only fill dataset_description when an old (non IMAS) JINTRAC case is imported into IMAS, as somewhere to put the import provenance metadata (same metadata that went in the old yaml files of the legacy DB). This was useful because we could add it separately after the other simulation IDS were already produced. There was a use case for metadata that needs to change (or be written) after simulation is finished (but maybe simdb handles all that now). Question is - do you want to store the "description" and other high level metadata that SimDB has also in an IDS? If so, that is what we used the dataset_description for.

When we run JINTRAC directly with IMAS, dataset_description is not produced. Workflow IDS is a totally different use case (more about internal workflow info, not metadata)

IMO it should be used to store a very high-level description (via a few metadata) about the dataset. Indeed this IDS will never be expected to be used as an IM code coupling interface, and be more leaning towards humans and databases.

I agree with this

DavidPCoster · 2025-05-26T09:00:54Z

DavidPCoster
May 26, 2025
Collaborator

We have had the discussion before: should the summary IDS contain information that is not stored any where else, or should it be based on data that is stored in other IDSes?
I raise this since it could affect moving fields from dataset_description to summary.

For me as a provider of summary data for AUG pulses, the important fields are

dataset_description.data_entry.user: dpc
dataset_description.data_entry.machine: ASDEX Upgrade
dataset_description.data_entry.pulse_type: pulse
dataset_description.data_entry.pulse: 41570
dataset_description.data_entry.run: 0
dataset_description.pulse_time_begin: 2022-07-27T16:10:00Z

What to contact in the case of a problem
Which machine produced the experimental data
That the data is experimental
Which shot number on that machine
Which version of the summary data this is
When was the pulse

As a provider of simulation results, the most important fields are

dataset_description.simulation.workflow
dataset_description.simulation.comment_before

and potentially the comments_after

1 reply

olivhoenen May 28, 2025
Maintainer

Thanks @DavidPCoster, I think we can store all these (in DD4 compatible version, e.g type pointing to a list of well defined identifiers rather than free string value) in summary. We concluded previously that there will never be a summary with all its content derived from other IDSs (but can certainly continue to think and implement functions that would help its filling from other IDSs).

DavidPCoster · 2025-05-26T09:05:47Z

DavidPCoster
May 26, 2025
Collaborator

I think parent_entry should also have a URI field, and -- if we ever get FAIR compliant -- a DOI field.

2 replies

olivhoenen May 28, 2025
Maintainer

parent_entry is already gone from dataset_description in DD4, so we won't push for having it in summary. Our idea to track parent datasets (e.g. inputs for a simulation/workflow) is to record or point to input URIs when registering the simulation in SimDB. This will allow to make sure the input datasets are present on the simdb server (linked to existing entry, or pushed together with the simulation results).

imbeauf May 28, 2025
Maintainer

I think "parent_entry" is a kind of provenance information and should be stored in the summary/ids_properties/provenance structure.
I think all data and metadata must have placeholders in IDSs, otherwise it creates new dependences to external tools (why would SimDB be mandatory to use IMAS data ?), needs for new APIs ...

olivhoenen · 2025-06-16T15:52:49Z

olivhoenen
Jun 16, 2025
Maintainer

To be continued as #83

0 replies

mkopsnc · 2025-07-11T08:10:25Z

mkopsnc
Jul 11, 2025

From the perspective of Catalog QT 2 application, it is not important whether information is stored inside summary or inside dataset_description. However, we need some info (currently taken from dataset_description) to properly handle the data import process:

data_entry/user                  Username {constant}                                                       STR_0D
data_entry/machine               Name of the experimental device to which this data is related {constant}  STR_0D
data_entry/pulse_type            Type of the data entry, e.g. "pulse", "simulation", ... {constant}        STR_0D
data_entry/pulse                 Pulse number {constant}                                                   INT_0D
data_entry/run                   Run number {constant}                                                     INT_0D
ids_properties/homogeneous_time  Must be set to 2 {constant}                                               INT_0D

We can, however, get this information from a different place (e.g. summary IDS).

0 replies

Feedback and ideas to improve dataset_description IDS #63

Uh oh!

Uh oh!

prasad-sawantdesai May 6, 2025 Collaborator

Replies: 10 comments · 8 replies

Uh oh!

Uh oh!

olivhoenen May 6, 2025 Maintainer

Uh oh!

paulotex May 6, 2025 Collaborator

Uh oh!

olivhoenen May 7, 2025 Maintainer

Uh oh!

paulotex May 7, 2025 Collaborator

Uh oh!

imbeauf May 15, 2025 Maintainer

Uh oh!

olivhoenen May 16, 2025 Maintainer

Uh oh!

imbeauf May 20, 2025 Maintainer

Uh oh!

olivhoenen May 20, 2025 Maintainer

Uh oh!

imbeauf May 22, 2025 Maintainer

Uh oh!

Uh oh!

olivhoenen May 23, 2025 Maintainer

Uh oh!

Uh oh!

fcasson May 23, 2025

Uh oh!

DavidPCoster May 26, 2025 Collaborator

Uh oh!

olivhoenen May 28, 2025 Maintainer

Uh oh!

DavidPCoster May 26, 2025 Collaborator

Uh oh!

olivhoenen May 28, 2025 Maintainer

Uh oh!

imbeauf May 28, 2025 Maintainer

Uh oh!

olivhoenen Jun 16, 2025 Maintainer

Uh oh!

Uh oh!

mkopsnc Jul 11, 2025

prasad-sawantdesai
May 6, 2025
Collaborator

Replies: 10 comments 8 replies

olivhoenen
May 6, 2025
Maintainer

paulotex
May 6, 2025
Collaborator

olivhoenen
May 7, 2025
Maintainer

paulotex
May 7, 2025
Collaborator

imbeauf
May 15, 2025
Maintainer

olivhoenen
May 16, 2025
Maintainer

imbeauf May 20, 2025
Maintainer

olivhoenen May 20, 2025
Maintainer

imbeauf May 22, 2025
Maintainer

olivhoenen May 23, 2025
Maintainer

DavidPCoster
May 26, 2025
Collaborator

olivhoenen May 28, 2025
Maintainer

DavidPCoster
May 26, 2025
Collaborator

olivhoenen May 28, 2025
Maintainer

imbeauf May 28, 2025
Maintainer

olivhoenen
Jun 16, 2025
Maintainer

mkopsnc
Jul 11, 2025