Skip to content

add kwarg to handle invalid files in open_mfdataset #9955

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 58 commits into
base: main
Choose a base branch
from

Conversation

pratiman-91
Copy link

Added new argument in open_mfdataset to better handle the invalid files.

errors : {'ignore', 'raise', 'warn'}, default 'raise'
        - If 'raise', then invalid dataset will raise an exception.
        - If 'ignore', then invalid dataset will be ignored.
        - If 'warn', then a warning will be issued for each invalid dataset.

Copy link

welcome bot commented Jan 16, 2025

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

@max-sixty
Copy link
Collaborator

I'm not the expert, but this looks reasonable! Any other thoughts?

Assuming no one thinks it's a bad idea, we would need tests.

Copy link
Collaborator

@headtr1ck headtr1ck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a good idea.

But the way it is implemented here seems overly complicated and repetitive.
I would suggest to revert the logic: first build up the list wrapped in a single try and then handle the three cases in the except block.

pratiman-91 and others added 2 commits January 17, 2025 10:53
Co-authored-by: Michael Niklas  <mick.niklas@gmail.com>
Copy link
Collaborator

@headtr1ck headtr1ck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there.

Also, we should add tests for this.

@pratiman-91
Copy link
Author

@headtr1ck Thanks for the suggestions. I have added two tests (ignore and warn). Also, while testing, I found that a new argument broke combine="nested" due to invalid ids. I have now modified it to reflect the correct ids, and it is passing the tests. Please review the tests and the latest version.

@pratiman-91
Copy link
Author

Hi @headtr1ck, I have been thinking about the handling of ids. Current version looks like a patch work (I am not happy with it.). I think we can create ids after removing all the invalid datasets from path1d within the combine==nested block. Please let me know what do you think.
Thanks!

@pratiman-91
Copy link
Author

@max-sixty Can you please go through the PR. Thanks!

@max-sixty
Copy link
Collaborator

I'm admittedly much less familiar with this section of the code. nothing seems wrong though!

I think we should bias towards merging, so if no one has concerns then I'd vote to merge

could we fix the errors in the docs?

@pratiman-91
Copy link
Author

It seems like one test failed test_sparse_dask_dataset_repr (xarray.tests.test_sparse.TestSparseDataArrayAndDataset) . It is not related to this PR.

@kmuehlbauer kmuehlbauer added the plan to merge Final call for comments label Jul 3, 2025
@kmuehlbauer kmuehlbauer changed the title Open mfdataset enchancement add kwarg to handle invalid files in open_mfdataset Jul 3, 2025
@kmuehlbauer kmuehlbauer removed the request for review from headtr1ck July 3, 2025 06:12
@kmuehlbauer kmuehlbauer enabled auto-merge (squash) July 4, 2025 04:52
@kmuehlbauer
Copy link
Contributor

@headtr1ck Would you mind having one last look here? I'm not able to merge this without your interaction wrt to a requested change. Thanks!

@kmuehlbauer kmuehlbauer requested a review from headtr1ck July 4, 2025 05:20
@kmuehlbauer
Copy link
Contributor

I'll close and reopen, just to check how this affects the merge rules.

@kmuehlbauer kmuehlbauer closed this Jul 7, 2025
auto-merge was automatically disabled July 7, 2025 06:40

Pull request was closed

@kmuehlbauer kmuehlbauer reopened this Jul 7, 2025
@headtr1ck
Copy link
Collaborator

@headtr1ck Would you mind having one last look here? I'm not able to merge this without your interaction wrt to a requested change. Thanks!

Sorry for the late reply, currently on holidays.

I am not sure about this approach.

For 1D arrays it works fine (even though maybe users want to have the options to have NaNs in positions of missing values, but due to xarrays indexing this should not be an issue).

But for ND nested lists, basically this PR only works if the user supplies an additional file in the place of the broken file. This implies that the user actually knows that one file is broken. So I wonder if this is useful at all?

I think the main use case is the following:
[["1.nc", "2.nc"], ["broken.nc", "4.nc"]]
Now the question is what should actually happen here... I think the only way out is to fill in the gap with NaNs.

But maybe this is supposed to be an additional PR?

@kmuehlbauer
Copy link
Contributor

kmuehlbauer commented Jul 7, 2025

Thanks @headtr1ck, I've had similar concerns here: #9955 (comment).

@pratiman-91's argument was, if the provided nested list would lead to a consistent output (after removal of any breaking file) it should return without erroring out. It's hard to imagine a real life use case, when and how this could happen. @pratiman-91 can you please elaborate your use case here?

I think the main use case is the following: [["1.nc", "2.nc"], ["broken.nc", "4.nc"]] Now the question is what should actually happen here... I think the only way out is to fill in the gap with NaNs.

But maybe this is supposed to be an additional PR?

I'd totally support this approach, although this seems not an easy task. But should go in a follow-up PR.

From my side, this is PR is still a first way forward, giving the user a bit more freedom when opening/combining nested lists of files.

@pratiman-91
Copy link
Author

Thanks, @headtr1ck and @kmuehlbauer.

I think the main use case is the following:
[["1.nc", "2.nc"], ["broken.nc", "4.nc"]]
Now the question is what should actually happen here... I think the only way out is to fill in the gap with NaNs.

In this case, two things will happen:

  1. If removing broken.nc results in a logically valid dataset, then we proceed with the rest.

  2. If it still doesn't form a valid dataset, an error should be raised.

Filling with NaNs introduces its own set of challenges. For instance, if the first file itself is broken, there's no reliable source for metadata needed to correctly generate NaNs. This approach becomes significantly more complex than simply skipping the broken file and would require a separate PR to handle properly.

Copy link
Collaborator

@headtr1ck headtr1ck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor changes are still required.

Another question: what happens now if someone passes a e.g. 2x2 list of files where one is broken?

Because as far as I can tell, if errors="ignore" this file will be silently removed but then later on the dataset cannot be constructed and quite likely will throw an error that will confuse the user.

@pratiman-91
Copy link
Author

@headtr1ck

Some minor changes are still required.

I have made changes based on your suggestions.

Another question: what happens now if someone passes a e.g. 2x2 list of files where one is broken?

Because as far as I can tell, if errors="ignore" this file will be silently removed but then later on the dataset cannot be constructed and quite likely will throw an error that will confuse the user.

I agree, that would be the case. An important assumption is that removing the files does not affect the overall validity of the datasets. I think it should be up to the user to use that option.

@pratiman-91 pratiman-91 requested a review from headtr1ck July 9, 2025 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

better handling of invalid files in open_mfdataset
4 participants