Skip to content

[core] Direct Forecasts - Incorrect Treatment of Exogenous Variables #496

@syrop87

Description

@syrop87

What happened + What you expected to happen

Hi,

I am a new user of MLForecast and I am highly impressed with its capabilities! However, I discovered a problematic behaviour when it comes to direct forecaasting - while it is not technically a bug, it can lead to the results are hard to justify.

Currently, the user can pass exogenous variables via X_df argument to predict method, both for recursive forecasts (one model per all horizons) and direct ones (each horizon uses separate model). However, in direct approach not all relevant entries in X_df seem to be used - even if we set horizon at 50, predict method will use only the X_df values corresponding to horizon == 1. That is problematic on many levels, e.g. if we want to use seasonality variables based on Fourier transform (so the value of sin / cos at horizon 50 should be set based on the corresponding date and not the date observed at horizon 1).

I attach the script showing the problem - the forecasts do not change when replacing meaningful values in X_df with zeros.

I realize you may not agree this is a bug (as TimeSeries.predict code includes a comment mentioning that behaviour). If so, please change the Issue tag.

Versions / Dependencies

mlforecast==1.0.2
statsforecast==2.0.1
lightgbm==4.6.0

Reproduction script

import random
import lightgbm as lgb
from mlforecast import MLForecast
from mlforecast.utils import generate_series
from utilsforecast.feature_engineering import fourier

# Forecast Horizon
H = 3

# Adding a static feature, as I don't know how to fit the model with dynamic features only (here sin & cos)
df = generate_series(3, freq="W", min_length=104, n_static_features=1)
df, _ = fourier(df, freq="W", season_length=52, k=1, h=H)

# Train and Test Split
ids = df["unique_id"].unique()
random.seed(0)
sample_ids = random.choices(ids, k=4)
sample_df = df[df["unique_id"].isin(sample_ids)]
test = sample_df.groupby("unique_id").tail(H)
train = sample_df.drop(test.index)

test_X_df = test.copy().drop(columns=["static_0", "y"])

# Replacing all entries of sin & cos in X_df with 0 where horizon > 1.
test_X_df_zeros = test_X_df.copy()
test_X_df_zeros["is_first_ds"] = test_X_df_zeros.groupby(["unique_id"])["ds"].transform(lambda x: x == x.min())
test_X_df_zeros.loc[~test_X_df_zeros["is_first_ds"], ["sin1_52", "cos1_52"]] = 0
test_X_df_zeros = test_X_df_zeros.drop(columns="is_first_ds")

fcst = MLForecast(
    models=lgb.LGBMRegressor(n_jobs=1, random_state=0, verbosity=-1),
    freq="W",
    lags=[1],
    date_features=["month"],
    num_threads=2,
)

# Direct forecasts as max_horizon is supplied to fit method
individual_fcst = fcst.fit(train, static_features=["static_0"], max_horizon=H)
individual_preds = individual_fcst.predict(h=H, X_df=test_X_df)
individual_preds_zeros = individual_fcst.predict(h=H, X_df=test_X_df_zeros)

# Forecasts on original test data and the one with zeros
# Results are the same!
individual_preds_all = individual_preds_zeros.merge(
    individual_preds, how="outer", on=["unique_id", "ds"], suffixes=("_zero", "_base")
)
print(individual_preds_all)
print(any((individual_preds_all["LGBMRegressor_zero"] - individual_preds_all["LGBMRegressor_base"]) != 0))

Issue Severity

Significant, stopping us from using Nixtla.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions