-
Notifications
You must be signed in to change notification settings - Fork 101
Description
What happened + What you expected to happen
Hi,
I am a new user of MLForecast and I am highly impressed with its capabilities! However, I discovered a problematic behaviour when it comes to direct forecaasting - while it is not technically a bug, it can lead to the results are hard to justify.
Currently, the user can pass exogenous variables via X_df
argument to predict
method, both for recursive forecasts (one model per all horizons) and direct ones (each horizon uses separate model). However, in direct approach not all relevant entries in X_df
seem to be used - even if we set horizon at 50, predict
method will use only the X_df
values corresponding to horizon == 1. That is problematic on many levels, e.g. if we want to use seasonality variables based on Fourier transform (so the value of sin / cos at horizon 50 should be set based on the corresponding date and not the date observed at horizon 1).
I attach the script showing the problem - the forecasts do not change when replacing meaningful values in X_df
with zeros.
I realize you may not agree this is a bug (as TimeSeries.predict
code includes a comment mentioning that behaviour). If so, please change the Issue tag.
Versions / Dependencies
mlforecast==1.0.2
statsforecast==2.0.1
lightgbm==4.6.0
Reproduction script
import random
import lightgbm as lgb
from mlforecast import MLForecast
from mlforecast.utils import generate_series
from utilsforecast.feature_engineering import fourier
# Forecast Horizon
H = 3
# Adding a static feature, as I don't know how to fit the model with dynamic features only (here sin & cos)
df = generate_series(3, freq="W", min_length=104, n_static_features=1)
df, _ = fourier(df, freq="W", season_length=52, k=1, h=H)
# Train and Test Split
ids = df["unique_id"].unique()
random.seed(0)
sample_ids = random.choices(ids, k=4)
sample_df = df[df["unique_id"].isin(sample_ids)]
test = sample_df.groupby("unique_id").tail(H)
train = sample_df.drop(test.index)
test_X_df = test.copy().drop(columns=["static_0", "y"])
# Replacing all entries of sin & cos in X_df with 0 where horizon > 1.
test_X_df_zeros = test_X_df.copy()
test_X_df_zeros["is_first_ds"] = test_X_df_zeros.groupby(["unique_id"])["ds"].transform(lambda x: x == x.min())
test_X_df_zeros.loc[~test_X_df_zeros["is_first_ds"], ["sin1_52", "cos1_52"]] = 0
test_X_df_zeros = test_X_df_zeros.drop(columns="is_first_ds")
fcst = MLForecast(
models=lgb.LGBMRegressor(n_jobs=1, random_state=0, verbosity=-1),
freq="W",
lags=[1],
date_features=["month"],
num_threads=2,
)
# Direct forecasts as max_horizon is supplied to fit method
individual_fcst = fcst.fit(train, static_features=["static_0"], max_horizon=H)
individual_preds = individual_fcst.predict(h=H, X_df=test_X_df)
individual_preds_zeros = individual_fcst.predict(h=H, X_df=test_X_df_zeros)
# Forecasts on original test data and the one with zeros
# Results are the same!
individual_preds_all = individual_preds_zeros.merge(
individual_preds, how="outer", on=["unique_id", "ds"], suffixes=("_zero", "_base")
)
print(individual_preds_all)
print(any((individual_preds_all["LGBMRegressor_zero"] - individual_preds_all["LGBMRegressor_base"]) != 0))
Issue Severity
Significant, stopping us from using Nixtla.