[BUG] Repeated calls to `to_numpy` in the Distribution `_get_bc_params_dict` method cause notable slow down

**Describe the bug**

There is a niche but significant performance issue when using some "non-standard" index type for input data (X, y). Specifically, this problem is most obvious when running an evaluation using CRPS score on distributions with no `numerically exact implementation of the 'energy' method` as repeated sampling of the distribution and calls to `to_numpy` within the `_get_bc_params_dict` add serious overhead. In the example below it creates a 4x slow down!

The `to_numpy` on a tz-aware datetime index gave HUGE slow down (100,000x vs non-tz-aware DatetimeIndex). This is a pandas -> numpy issue where numpy doesn't support timezone-aware datetimes (same with other non-native numpy dtypes). I raised an issue with pandas https://github.com/pandas-dev/pandas/issues/62309#issuecomment-3294093815 but, as numpy is unlikely to add tz-aware datetimes, pandas cannot improve this. They might consider raising a warning. As mentioned, this issue is magnified in skpro in many contexts due to repeated sampling of the distribution.

Index types affects:
- tz-aware DatetimeIndex (4x slowdown)
- PeriodIndex (4x slowdown)
- IntervalIndex (12x slowdown)
- Datetime IntervalIndex (4x slowdown)
- MultiIndex with any level containing the above types (>4x slowdown depending on combo)

Not affected
- RangeIndex
- CategoricalIndex
- DatatimeIndex (not tz-aware)
- MulitIndex of the above types only

**To Reproduce**

In a jupyter notebook for cell timings:

```python
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from skpro.benchmarking.evaluate import evaluate
from skpro.metrics import CRPS
from skpro.regression.compose import TransformedTargetRegressor
from skpro.regression.residual import ResidualDouble

length = 8000
X = np.random.uniform(1, 3, length)
X_ = X + np.random.normal(0, 0.5, length)
y = (X_ ** 2) + X + 1
Xy = pd.DataFrame({"X": X, "y": y})

cv = KFold(n_splits=3)

model = TransformedTargetRegressor(
    ResidualDouble(
        LinearRegression(),
        RandomForestRegressor(),
    ),
    transformer=StandardScaler() 
    # res double failed without SS
    # also provides non-numerical energy method
)

Xy_dt = (
    Xy
    .assign(datetime=pd.date_range("2023-01-01", freq="h", periods=length))
    .set_index('datetime')
)

Xy_dt_utc = Xy_dt.tz_localize('UTC')
Xy_dt_uk = Xy_dt_utc.tz_convert('Europe/London')

# non-native numpy index type
Xy_dt_utc = Xy_dt.tz_localize('UTC')
```
```python
X = Xy[['X']]
y = Xy[['y']]
evaluate(model, cv, X, y, scoring=CRPS())
```
> 4.6s

```python
X = Xy_dt_uk[['X']]
y = Xy_dt_uk[['y']]
evaluate(model, cv, X, y, scoring=CRPS())
```
> 19.2

I have an extended notebook that I will share with all other index types but it is a bit long for here.

**Expected behavior**

To run a bit faster or to raise a warning for non-standard dtypes. There are several options I see that can improve the user experience here:

1) use `.values` instead of calling `to_numpy`. This makes tz-aware run the same speed as non-tz-aware. This has one con in that it is not recommended by pandas and behaviour could change in the future. However, this is very unlikely as `.values` is used extensively within the Pandas community For more info see the release notes for [to_numpy](https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.24.0.html#accessing-the-values-in-a-series-or-index). Locally, this is what I have been using for a 4x speed-up.

2) Just raise a warning for non-standard indices. Users can preprocess their index accordingly, e.g. `Xy.tz_localize(None)`.

3) Consider a more significant change to `_get_bc_params_dict` and how it is called

**Environment**
This is a numpy dtype issue so speeds shouldn't vary significantly across OS or package versions.

I am running this on MacOS with skpro version 2.9.2, Pandas 2.3, numpy 1.26.4.

**Additional context**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Repeated calls to `to_numpy` in the Distribution `_get_bc_params_dict` method cause notable slow down #597

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Repeated calls to to_numpy in the Distribution _get_bc_params_dict method cause notable slow down #597

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[BUG] Repeated calls to `to_numpy` in the Distribution `_get_bc_params_dict` method cause notable slow down #597