Skip to content

[BUG] save_time_based_splits function does not support CPU mode well #789

@lmatejka

Description

@lmatejka

Bug description

Function save_time_based_splits in data_utils.py does not support CPU mode correctly. In particular, function _save_time_based_splits_cpu assumes using Rapids libraries, moreover Dask Dataframe seems incorrectly imported.

Steps/Code to reproduce bug

Using code from examples, just with option CPU set to True (https://github.com/NVIDIA-Merlin/Transformers4Rec/blob/main/examples/getting-started-session-based/01-ETL-with-NVTabular.ipynb)

sessions_gdf = df.read_parquet(BASE_PATH / "processed_nvt/part_0.parquet")
from transformers4rec.utils.data_utils import save_time_based_splits

save_time_based_splits(
data=nvt.Dataset(sessions_gdf),
output_dir=BASE_PATH / f"session_by_day",
partition_col="day-first",
timestamp_col="session_id",
cpu=True
)

Expected behavior

No exception is thrown and data are splitted.

Environment details

  • Transformers4Rec version: 23.12.0
  • Platform: Ubuntu 20.04.3 LTS
  • Python version: 3.8.10
  • Huggingface Transformers version: 4.30.2
  • PyTorch version (GPU?): 2.4.1
  • Tensorflow version (GPU?): 2.7.0

Additional context

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions