add kwarg to handle invalid files in open_mfdataset #9955

pratiman-91 · 2025-01-16T15:30:57Z

Closes better handling of invalid files in open_mfdataset #6736
User visible changes (including notable bug fixes) are documented in whats-new.rst

Added new argument in open_mfdataset to better handle the invalid files.

errors : {'ignore', 'raise', 'warn'}, default 'raise'
        - If 'raise', then invalid dataset will raise an exception.
        - If 'ignore', then invalid dataset will be ignored.
        - If 'warn', then a warning will be issued for each invalid dataset.

welcome · 2025-01-16T15:31:01Z

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

max-sixty · 2025-01-16T20:35:25Z

I'm not the expert, but this looks reasonable! Any other thoughts?

Assuming no one thinks it's a bad idea, we would need tests.

headtr1ck

I think it is a good idea.

But the way it is implemented here seems overly complicated and repetitive.
I would suggest to revert the logic: first build up the list wrapped in a single try and then handle the three cases in the except block.

xarray/backends/api.py

Co-authored-by: Michael Niklas <mick.niklas@gmail.com>

headtr1ck

Almost there.

Also, we should add tests for this.

xarray/backends/api.py

pratiman-91 · 2025-01-19T10:16:28Z

@headtr1ck Thanks for the suggestions. I have added two tests (ignore and warn). Also, while testing, I found that a new argument broke combine="nested" due to invalid ids. I have now modified it to reflect the correct ids, and it is passing the tests. Please review the tests and the latest version.

xarray/backends/api.py

…d warn.

pratiman-91 · 2025-01-20T01:04:51Z

Hi @headtr1ck, I have been thinking about the handling of ids. Current version looks like a patch work (I am not happy with it.). I think we can create ids after removing all the invalid datasets from path1d within the combine==nested block. Please let me know what do you think.
Thanks!

pratiman-91 · 2025-03-31T02:06:15Z

@max-sixty Can you please go through the PR. Thanks!

max-sixty · 2025-03-31T18:20:59Z

I'm admittedly much less familiar with this section of the code. nothing seems wrong though!

I think we should bias towards merging, so if no one has concerns then I'd vote to merge

could we fix the errors in the docs?

xarray/backends/api.py

pratiman-91 · 2025-04-04T02:58:11Z

It seems like one test failed test_sparse_dask_dataset_repr (xarray.tests.test_sparse.TestSparseDataArrayAndDataset) . It is not related to this PR.

for more information, see https://pre-commit.ci

kmuehlbauer · 2025-07-04T05:20:10Z

@headtr1ck Would you mind having one last look here? I'm not able to merge this without your interaction wrt to a requested change. Thanks!

kmuehlbauer · 2025-07-07T06:40:30Z

I'll close and reopen, just to check how this affects the merge rules.

headtr1ck · 2025-07-07T07:08:04Z

@headtr1ck Would you mind having one last look here? I'm not able to merge this without your interaction wrt to a requested change. Thanks!

Sorry for the late reply, currently on holidays.

I am not sure about this approach.

For 1D arrays it works fine (even though maybe users want to have the options to have NaNs in positions of missing values, but due to xarrays indexing this should not be an issue).

But for ND nested lists, basically this PR only works if the user supplies an additional file in the place of the broken file. This implies that the user actually knows that one file is broken. So I wonder if this is useful at all?

I think the main use case is the following:
[["1.nc", "2.nc"], ["broken.nc", "4.nc"]]
Now the question is what should actually happen here... I think the only way out is to fill in the gap with NaNs.

But maybe this is supposed to be an additional PR?

kmuehlbauer · 2025-07-07T07:30:47Z

Thanks @headtr1ck, I've had similar concerns here: #9955 (comment).

@pratiman-91's argument was, if the provided nested list would lead to a consistent output (after removal of any breaking file) it should return without erroring out. It's hard to imagine a real life use case, when and how this could happen. @pratiman-91 can you please elaborate your use case here?

I think the main use case is the following: [["1.nc", "2.nc"], ["broken.nc", "4.nc"]] Now the question is what should actually happen here... I think the only way out is to fill in the gap with NaNs.

But maybe this is supposed to be an additional PR?

I'd totally support this approach, although this seems not an easy task. But should go in a follow-up PR.

From my side, this is PR is still a first way forward, giving the user a bit more freedom when opening/combining nested lists of files.

pratiman-91 · 2025-07-07T07:53:28Z

Thanks, @headtr1ck and @kmuehlbauer.

I think the main use case is the following:
[["1.nc", "2.nc"], ["broken.nc", "4.nc"]]
Now the question is what should actually happen here... I think the only way out is to fill in the gap with NaNs.

In this case, two things will happen:

If removing broken.nc results in a logically valid dataset, then we proceed with the rest.
If it still doesn't form a valid dataset, an error should be raised.

Filling with NaNs introduces its own set of challenges. For instance, if the first file itself is broken, there's no reliable source for metadata needed to correctly generate NaNs. This approach becomes significantly more complex than simply skipping the broken file and would require a separate PR to handle properly.

headtr1ck

Some minor changes are still required.

Another question: what happens now if someone passes a e.g. 2x2 list of files where one is broken?

Because as far as I can tell, if errors="ignore" this file will be silently removed but then later on the dataset cannot be constructed and quite likely will throw an error that will confuse the user.

doc/whats-new.rst

xarray/backends/api.py

for more information, see https://pre-commit.ci

pratiman-91 · 2025-07-09T15:36:53Z

@headtr1ck

Some minor changes are still required.

I have made changes based on your suggestions.

Another question: what happens now if someone passes a e.g. 2x2 list of files where one is broken?

Because as far as I can tell, if errors="ignore" this file will be silently removed but then later on the dataset cannot be constructed and quite likely will throw an error that will confuse the user.

I agree, that would be the case. An important assumption is that removing the files does not affect the overall validity of the datasets. I think it should be up to the user to use that option.

for more information, see https://pre-commit.ci

pratiman-91 and others added 2 commits January 16, 2025 23:06

GH6736

3ec575d

Updated whats-new.rst

5b95c21

headtr1ck requested changes Jan 16, 2025

View reviewed changes

xarray/backends/api.py Outdated Show resolved Hide resolved

xarray/backends/api.py Outdated Show resolved Hide resolved

pratiman-91 and others added 2 commits January 17, 2025 10:53

Update xarray/backends/api.py

9249bf3

Co-authored-by: Michael Niklas <mick.niklas@gmail.com>

Updated logic

1eb6422

headtr1ck reviewed Jan 18, 2025

View reviewed changes

xarray/backends/api.py Outdated Show resolved Hide resolved

xarray/backends/api.py Outdated Show resolved Hide resolved

headtr1ck added the topic-error reporting label Jan 18, 2025

Added tests and modifiede the logic to get correct ids for concat

8005e33

headtr1ck reviewed Jan 19, 2025

View reviewed changes

xarray/backends/api.py Outdated Show resolved Hide resolved

pratiman-91 and others added 2 commits January 19, 2025 22:42

Added new tests and logic to handle 2x2 open_mfdataset with ignore an…

3bfaaee

…d warn.

pre-commit run

f621030

new logic to add nested paths

b9f04c8

pratiman-91 requested a review from headtr1ck February 16, 2025 05:00

max-sixty reviewed Mar 31, 2025

View reviewed changes

xarray/backends/api.py Outdated Show resolved Hide resolved

pratiman-91 and others added 2 commits April 4, 2025 10:25

made remove_path a private function and updated whats-new.rst

0657014

Merge branch 'main' into open_mfdataset_enchancement

4dd6da4

pratiman-91 added 2 commits April 7, 2025 11:02

Merge branch 'main' into open_mfdataset_enchancement

232ab45

Merge branch 'main' into open_mfdataset_enchancement

1110a28

github-actions bot added topic-backends topic-documentation io labels Apr 14, 2025

pratiman-91 and others added 2 commits April 15, 2025 09:59

Updated whats-new.rst

ffc3c53

[pre-commit.ci] auto fixes from pre-commit.com hooks

efe1642

for more information, see https://pre-commit.ci

kmuehlbauer added the plan to merge Final call for comments label Jul 3, 2025

kmuehlbauer changed the title ~~Open mfdataset enchancement~~ add kwarg to handle invalid files in open_mfdataset Jul 3, 2025

kmuehlbauer approved these changes Jul 3, 2025

View reviewed changes

kmuehlbauer removed the request for review from headtr1ck July 3, 2025 06:12

kmuehlbauer and others added 2 commits July 4, 2025 06:43

Merge branch 'main' into open_mfdataset_enchancement

af922d1

Update whats-new.rst

7bddbc4

kmuehlbauer enabled auto-merge (squash) July 4, 2025 04:52

kmuehlbauer requested a review from headtr1ck July 4, 2025 05:20

Merge branch 'main' into open_mfdataset_enchancement

efe977a

kmuehlbauer closed this Jul 7, 2025

auto-merge was automatically disabled July 7, 2025 06:40
Pull request was closed

kmuehlbauer reopened this Jul 7, 2025

headtr1ck requested changes Jul 7, 2025

View reviewed changes

doc/whats-new.rst Outdated Show resolved Hide resolved

xarray/backends/api.py Outdated Show resolved Hide resolved

xarray/backends/api.py Outdated Show resolved Hide resolved

pratiman-91 and others added 6 commits July 9, 2025 12:09

set of invalid files in a set and remove them only once

2af2ce3

modified the logic to remove invalid files.

4241372

Merge branch 'main' into open_mfdataset_enchancement

645df1f

[pre-commit.ci] auto fixes from pre-commit.com hooks

8ff7593

for more information, see https://pre-commit.ci

import ing TypeVar

096a133

[pre-commit.ci] auto fixes from pre-commit.com hooks

87ebcf9

for more information, see https://pre-commit.ci

pratiman-91 and others added 4 commits July 9, 2025 16:38

making FLike private

229228e

fixing mypy errors

7929dd3

importing List from typing

1fbc34f

[pre-commit.ci] auto fixes from pre-commit.com hooks

cbdb290

for more information, see https://pre-commit.ci

pratiman-91 requested a review from headtr1ck July 9, 2025 16:15

Uh oh!

add kwarg to handle invalid files in open_mfdataset #9955

Are you sure you want to change the base?

add kwarg to handle invalid files in open_mfdataset #9955

Conversation

pratiman-91 commented Jan 16, 2025

Uh oh!

welcome bot commented Jan 16, 2025

Uh oh!

max-sixty commented Jan 16, 2025

Uh oh!

headtr1ck left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

headtr1ck left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pratiman-91 commented Jan 19, 2025

Uh oh!

Uh oh!

pratiman-91 commented Jan 20, 2025

Uh oh!

pratiman-91 commented Mar 31, 2025

Uh oh!

max-sixty commented Mar 31, 2025

Uh oh!

Uh oh!

pratiman-91 commented Apr 4, 2025

Uh oh!

kmuehlbauer commented Jul 4, 2025

Uh oh!

kmuehlbauer commented Jul 7, 2025

Uh oh!

headtr1ck commented Jul 7, 2025

Uh oh!

kmuehlbauer commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pratiman-91 commented Jul 7, 2025

Uh oh!

headtr1ck left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pratiman-91 commented Jul 9, 2025

Uh oh!

Uh oh!

kmuehlbauer commented Jul 7, 2025 •

edited

Loading