Skip to content

Predictor Arrays in one Dataset per scenario #677

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

veni-vidi-vici-dormivi
Copy link
Collaborator

@veni-vidi-vici-dormivi veni-vidi-vici-dormivi commented May 6, 2025

Okay, since this is maybe the last chance to change something big about the DataTree structure, here goes my idea:

Instead of collecting the different predictor variables (tas, hfds) in subtrees and those holding scenarios for each variables like this:

├─ tas
│  ├─ hist
│  ├─ scen1
│  └─ ...
├─ hfds
│  ├─ hist
│  ├─ scen1
│  └─ ...
└─ ...

I suggest putting the variables together into datasets and these into a Tree:

├─ hist
|        datavars: tass, hfds....
├─ scen1
|        datavars: tass, hfds....
└─ ...

This is more elegant since the different predictors actually live on the same coordinate system they do fit into a dataset together - so we would be using the data structures in their intended way. We same on memory because we don't have all the coordinated duplicated (which is the rationale behind the intended use).

However, I feel that the Code does get more complicated like this. So I need a second opinion on this.

  • Tests added
  • Fully documented, including CHANGELOG.rst

@veni-vidi-vici-dormivi veni-vidi-vici-dormivi marked this pull request as draft May 6, 2025 13:38
@mathause
Copy link
Member

mathause commented May 8, 2025

Good point - the predictor datatree had to look different than the target which is maybe a bit confusing... How difficult/ annoying is it to add a new variable to the nodes?

This will have an influence on MESMER-X.


The third option is to have separate datatrees

tas:
├─ hist
├─ scen1
└─ ...

hfds:
├─ hist
├─ scen1
└─ ...

@veni-vidi-vici-dormivi
Copy link
Collaborator Author

How difficult/ annoying is it to add a new variable to the nodes?

Not so bad if it shouldn't be generalizable/automated (as in the tests) -- i.e. if you know which ones you want and can do it by names.

The much more annoying part is getting an isomorphic datatree with a subset of variables from one with all variables. Which we need if we eg load a datatree that has both tas and hfds and somewhere in the code only want one with tas.

This will have an influence on MESMER-X.

Yes.

The third option is to have separate datatrees

Yes. Also not a big fan of that

@mathause
Copy link
Member

mathause commented May 8, 2025

How difficult/ annoying is it to add a new variable to the nodes?

Not so bad if it shouldn't be generalizable/automated (as in the tests) -- i.e. if you know which ones you want and can do it by names.

The much more annoying part is getting an isomorphic datatree with a subset of variables from one with all variables. Which we need if we eg load a datatree that has both tas and hfds and somewhere in the code only want one with tas.

Yes this is a general problem - e.g. when we wrote a function for a DataArray we cannot easily generalize to Dataset or DataTree. I am not a fan for passing the variable name, but it may be unavoidable. (Related - there are some functions that check you pass a Dataset with only one variable - we'd need to adjust those...)

This will have an influence on MESMER-X.

Yes.

The third option is to have separate datatrees

Yes. Also not a big fan of that

Ok

Copy link

codecov bot commented May 28, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.21%. Comparing base (aa049b9) to head (d16411a).
Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #677      +/-   ##
==========================================
- Coverage   92.28%   92.21%   -0.07%     
==========================================
  Files          48       48              
  Lines        2785     2762      -23     
==========================================
- Hits         2570     2547      -23     
  Misses        215      215              
Flag Coverage Δ
unittests 92.21% <100.00%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mathause mathause mentioned this pull request Jun 7, 2025
3 tasks
exclude.add("tas2")
if use_hfds:
exclude.add("hfds")
local_variability_from_global_var = lr.predict(predictors, exclude=exclude)
Copy link
Member

@mathause mathause Jun 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes me wonder if we should allow lr.predict(predictors, only="tas_resids").

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that would be pretty, but not urgent.

@veni-vidi-vici-dormivi
Copy link
Collaborator Author

is #666 done?

@mathause
Copy link
Member

#666 is done (I think) but I need to merge the other 2 PRs first. I'll give this another pass, and we can maybe merge all of them tomorrow.

@veni-vidi-vici-dormivi
Copy link
Collaborator Author

and we can maybe merge all of them tomorrow.

🎉 🎉

then we'll deal with mesmer-x and the tutorials next

Copy link
Member

@mathause mathause left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mathause
Copy link
Member

@veni-vidi-vici-dormivi can you merge main, re-create the test files and push - then we can merge

@mathause
Copy link
Member

Hmm strange that the power_transformer numbers failed. I think that should not have changed 🤔

@veni-vidi-vici-dormivi
Copy link
Collaborator Author

Hmm strange that the power_transformer numbers failed. I think that should not have changed 🤔

Yeah no, I just pushed new files where I left that file the same and it looks good. The thing is that this one fails for me locally. That already happened before this PR, but I don't know when. 🙈

@mathause
Copy link
Member

Hmm strange that the power_transformer numbers failed. I think that should not have changed 🤔

Yeah no, I just pushed new files where I left that file the same and it looks good. The thing is that this one fails for me locally. That already happened before this PR, but I don't know when. 🙈

Would be good to get it to pass for you and on GHA, but yes optimizations are annoying 😞

@veni-vidi-vici-dormivi veni-vidi-vici-dormivi merged commit af526b1 into MESMER-group:main Jun 12, 2025
11 checks passed
@veni-vidi-vici-dormivi veni-vidi-vici-dormivi deleted the multiple_preds branch June 12, 2025 14:46
veni-vidi-vici-dormivi added a commit to veni-vidi-vici-dormivi/mesmer that referenced this pull request Jun 19, 2025
…-group#677)

* started draft to have the different predictors in one dataset per scenario

* use datatree.assign

* add datatree merge function

* merge datatrees for lr predictors in calibration test

* adjust broadcast_and_stack

* finish new datastructure implementation (fails)

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove cmip gen

* change merge method to use lambda

* select tas instead of dropping other vars from datatree

* edit docstring of broadcast_and_stack_scenarios

* adjust test of broadcast_and_stack_scenarios

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* merge docstring

* test merge function

* extend _skip_empty_nodes to non DataTree args in first position

* wrap LinearRegression.predict() for datatree

* adjust volc

* fix unit tests

* fix in _skip_empty_nodes for only coords datasets

* datatree wrapper for draw autoregression funcs

* remove DataTree in LinearRegression.fit()

* updated draw_realisations new (in progress)

* new data reading in draw realisations

* move tas**2 after volcanic forcing

* new emulation sets

* adjustments for other tests

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update tests/integration/test_calibrate_mesmer_newcodepath.py

* Update tests/integration/test_draw_realisations_mesmer_x.py

* Update tests/integration/test_draw_realisations_mesmer_x.py

* fix merge error

* Update mesmer/core/datatree.py

Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>

* Update mesmer/core/datatree.py

Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>

* merge variable datasets in load

* move xarray type imports to the top

* implement suggestions in draw_realisations_mesmer_newcodepath

* precommit

* CHANGELOG

* adjust numbers in tests related to aod

* update testfiles

* new AR params test file for calibrate_mesmer_x with 2nd fit

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Mathias Hauser <mathias.hauser@env.ethz.ch>
Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants