-
Notifications
You must be signed in to change notification settings - Fork 58
Description
Describe the bug
There is a niche but significant performance issue when using some "non-standard" index type for input data (X, y). Specifically, this problem is most obvious when running an evaluation using CRPS score on distributions with no numerically exact implementation of the 'energy' method as repeated sampling of the distribution and calls to to_numpy within the _get_bc_params_dict add serious overhead. In the example below it creates a 4x slow down!
The to_numpy on a tz-aware datetime index gave HUGE slow down (100,000x vs non-tz-aware DatetimeIndex). This is a pandas -> numpy issue where numpy doesn't support timezone-aware datetimes (same with other non-native numpy dtypes). I raised an issue with pandas pandas-dev/pandas#62309 (comment) but, as numpy is unlikely to add tz-aware datetimes, pandas cannot improve this. They might consider raising a warning. As mentioned, this issue is magnified in skpro in many contexts due to repeated sampling of the distribution.
Index types affects:
- tz-aware DatetimeIndex (4x slowdown)
- PeriodIndex (4x slowdown)
- IntervalIndex (12x slowdown)
- Datetime IntervalIndex (4x slowdown)
- MultiIndex with any level containing the above types (>4x slowdown depending on combo)
Not affected
- RangeIndex
- CategoricalIndex
- DatatimeIndex (not tz-aware)
- MulitIndex of the above types only
To Reproduce
In a jupyter notebook for cell timings:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from skpro.benchmarking.evaluate import evaluate
from skpro.metrics import CRPS
from skpro.regression.compose import TransformedTargetRegressor
from skpro.regression.residual import ResidualDouble
length = 8000
X = np.random.uniform(1, 3, length)
X_ = X + np.random.normal(0, 0.5, length)
y = (X_ ** 2) + X + 1
Xy = pd.DataFrame({"X": X, "y": y})
cv = KFold(n_splits=3)
model = TransformedTargetRegressor(
ResidualDouble(
LinearRegression(),
RandomForestRegressor(),
),
transformer=StandardScaler()
# res double failed without SS
# also provides non-numerical energy method
)
Xy_dt = (
Xy
.assign(datetime=pd.date_range("2023-01-01", freq="h", periods=length))
.set_index('datetime')
)
Xy_dt_utc = Xy_dt.tz_localize('UTC')
Xy_dt_uk = Xy_dt_utc.tz_convert('Europe/London')
# non-native numpy index type
Xy_dt_utc = Xy_dt.tz_localize('UTC')X = Xy[['X']]
y = Xy[['y']]
evaluate(model, cv, X, y, scoring=CRPS())4.6s
X = Xy_dt_uk[['X']]
y = Xy_dt_uk[['y']]
evaluate(model, cv, X, y, scoring=CRPS())19.2
I have an extended notebook that I will share with all other index types but it is a bit long for here.
Expected behavior
To run a bit faster or to raise a warning for non-standard dtypes. There are several options I see that can improve the user experience here:
-
use
.valuesinstead of callingto_numpy. This makes tz-aware run the same speed as non-tz-aware. This has one con in that it is not recommended by pandas and behaviour could change in the future. However, this is very unlikely as.valuesis used extensively within the Pandas community For more info see the release notes for to_numpy. Locally, this is what I have been using for a 4x speed-up. -
Just raise a warning for non-standard indices. Users can preprocess their index accordingly, e.g.
Xy.tz_localize(None). -
Consider a more significant change to
_get_bc_params_dictand how it is called
Environment
This is a numpy dtype issue so speeds shouldn't vary significantly across OS or package versions.
I am running this on MacOS with skpro version 2.9.2, Pandas 2.3, numpy 1.26.4.
Additional context