-
Notifications
You must be signed in to change notification settings - Fork 155
Description
Bug description
Function save_time_based_splits in data_utils.py does not support CPU mode correctly. In particular, function _save_time_based_splits_cpu assumes using Rapids libraries, moreover Dask Dataframe seems incorrectly imported.
Steps/Code to reproduce bug
Using code from examples, just with option CPU set to True (https://github.com/NVIDIA-Merlin/Transformers4Rec/blob/main/examples/getting-started-session-based/01-ETL-with-NVTabular.ipynb)
sessions_gdf = df.read_parquet(BASE_PATH / "processed_nvt/part_0.parquet")
from transformers4rec.utils.data_utils import save_time_based_splits
save_time_based_splits(
data=nvt.Dataset(sessions_gdf),
output_dir=BASE_PATH / f"session_by_day",
partition_col="day-first",
timestamp_col="session_id",
cpu=True
)
Expected behavior
No exception is thrown and data are splitted.
Environment details
- Transformers4Rec version: 23.12.0
- Platform: Ubuntu 20.04.3 LTS
- Python version: 3.8.10
- Huggingface Transformers version: 4.30.2
- PyTorch version (GPU?): 2.4.1
- Tensorflow version (GPU?): 2.7.0