Skip to content

[BUG] Repeated calls to to_numpy in the Distribution _get_bc_params_dict method cause notable slow down #597

@joshdunnlime

Description

@joshdunnlime

Describe the bug

There is a niche but significant performance issue when using some "non-standard" index type for input data (X, y). Specifically, this problem is most obvious when running an evaluation using CRPS score on distributions with no numerically exact implementation of the 'energy' method as repeated sampling of the distribution and calls to to_numpy within the _get_bc_params_dict add serious overhead. In the example below it creates a 4x slow down!

The to_numpy on a tz-aware datetime index gave HUGE slow down (100,000x vs non-tz-aware DatetimeIndex). This is a pandas -> numpy issue where numpy doesn't support timezone-aware datetimes (same with other non-native numpy dtypes). I raised an issue with pandas pandas-dev/pandas#62309 (comment) but, as numpy is unlikely to add tz-aware datetimes, pandas cannot improve this. They might consider raising a warning. As mentioned, this issue is magnified in skpro in many contexts due to repeated sampling of the distribution.

Index types affects:

  • tz-aware DatetimeIndex (4x slowdown)
  • PeriodIndex (4x slowdown)
  • IntervalIndex (12x slowdown)
  • Datetime IntervalIndex (4x slowdown)
  • MultiIndex with any level containing the above types (>4x slowdown depending on combo)

Not affected

  • RangeIndex
  • CategoricalIndex
  • DatatimeIndex (not tz-aware)
  • MulitIndex of the above types only

To Reproduce

In a jupyter notebook for cell timings:

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from skpro.benchmarking.evaluate import evaluate
from skpro.metrics import CRPS
from skpro.regression.compose import TransformedTargetRegressor
from skpro.regression.residual import ResidualDouble

length = 8000
X = np.random.uniform(1, 3, length)
X_ = X + np.random.normal(0, 0.5, length)
y = (X_ ** 2) + X + 1
Xy = pd.DataFrame({"X": X, "y": y})

cv = KFold(n_splits=3)

model = TransformedTargetRegressor(
    ResidualDouble(
        LinearRegression(),
        RandomForestRegressor(),
    ),
    transformer=StandardScaler() 
    # res double failed without SS
    # also provides non-numerical energy method
)

Xy_dt = (
    Xy
    .assign(datetime=pd.date_range("2023-01-01", freq="h", periods=length))
    .set_index('datetime')
)

Xy_dt_utc = Xy_dt.tz_localize('UTC')
Xy_dt_uk = Xy_dt_utc.tz_convert('Europe/London')

# non-native numpy index type
Xy_dt_utc = Xy_dt.tz_localize('UTC')
X = Xy[['X']]
y = Xy[['y']]
evaluate(model, cv, X, y, scoring=CRPS())

4.6s

X = Xy_dt_uk[['X']]
y = Xy_dt_uk[['y']]
evaluate(model, cv, X, y, scoring=CRPS())

19.2

I have an extended notebook that I will share with all other index types but it is a bit long for here.

Expected behavior

To run a bit faster or to raise a warning for non-standard dtypes. There are several options I see that can improve the user experience here:

  1. use .values instead of calling to_numpy. This makes tz-aware run the same speed as non-tz-aware. This has one con in that it is not recommended by pandas and behaviour could change in the future. However, this is very unlikely as .values is used extensively within the Pandas community For more info see the release notes for to_numpy. Locally, this is what I have been using for a 4x speed-up.

  2. Just raise a warning for non-standard indices. Users can preprocess their index accordingly, e.g. Xy.tz_localize(None).

  3. Consider a more significant change to _get_bc_params_dict and how it is called

Environment
This is a numpy dtype issue so speeds shouldn't vary significantly across OS or package versions.

I am running this on MacOS with skpro version 2.9.2, Pandas 2.3, numpy 1.26.4.

Additional context

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions