Skip to content

Bug in data writer #506

@mlangguth89

Description

@mlangguth89

What happened?

When running inference on a model trained with multiple input data streams, but requesting output for only a subset of those streams using the --analysis_streams_output-argument, incorrect data may be written to disk depending on the order of the input streams.
Specifically, if the first input stream is omitted in --analysis_streams_output, the output from that omitted stream may incorrectly be written to the Zarr directory corresponding to the second stream.

What are the steps to reproduce the bug?

  1. Train model on ERA5, NPPATMS and SurfaceCombined data (other streams are possible here):
    ../WeatherGenerator-private/hpc/launch-slurm.py --config config/mixed.yml --time 5
    
  2. Run inference on the trained model, but omit the ERA5-stream
     uv run --offline inference --from_run_id xcl9xai1  --samples 2 --config ./config/mixed.yml --analysis_streams_output NPPATMS SurfaceCombined
    
  3. Read the target-data from NPPATMS (whihc is supoosed to have 22 channels) in an interactive Python-shell:
>>> import dask.array as da
>>> arr = da.from_zarr("<PATH_TO_DATA/validation_epoch00000_rank0000.zarr/0/NPPATMS/0/target/data")
>>> print(da)
    dask.array<from-zarr, shape=(24801, 70), dtype=float32, chunksize=(16392, 70), chunktype=numpy.ndarray>

Thus, we get data for 70 channels, which corresponds to the ERA5-data in this example.

Version

develop

Platform (OS and architecture)

Linux

Relevant log output

See 'Steps to reproduce'-section

Accompanying data

No response

Organisation

JSC

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingevaluationanything related to the model evaluation pipeline

Type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions