Skip to content

Documentation: train + RayDMatrix vs XGBoostTrainer + Dataset #309

@andreostrovsky

Description

@andreostrovsky

Could you please provide some clarification on the differences and/or how to choose between using xgboost_ray.train + xgboost_ray.RayDMatrix or ray.train.xgboost.XGBoostTrainer + ray.data.Dataset?

My use case is running Ray Tune on Azure Databricks, which operates on Spark. According to the Databricks docs, one creates a Ray Cluster using the Ray on Spark API, and creates a Ray Dataset from Parquet files.

Below are the questions I would like clarification on. Any help you could provide would be greatly appreciated.

Data

According to the README.md one can create a RayDMatrix from either Parquet files or a Ray Dataset:

xgboost_ray/README.md

Lines 450 to 465 in e904925

### Data sources
The following data sources can be used with a `RayDMatrix` object.
| Type | Centralized loading | Distributed loading |
|------------------------------------------------------------------|---------------------|---------------------|
| Numpy array | Yes | No |
| Pandas dataframe | Yes | No |
| Single CSV | Yes | No |
| Multi CSV | Yes | Yes |
| Single Parquet | Yes | No |
| Multi Parquet | Yes | Yes |
| [Ray Dataset](https://docs.ray.io/en/latest/data/dataset.html) | Yes | Yes |
| [Petastorm](https://github.com/uber/petastorm) | Yes | Yes |
| [Dask dataframe](https://docs.dask.org/en/latest/dataframe.html) | Yes | Yes |
| [Modin dataframe](https://modin.readthedocs.io/en/latest/) | Yes | Yes |

So if using xgboost_ray, should I

  • Create a Ray Dataset from Parquet files, then create a RayDMatrix from that Dataset
    or
  • Create the RayDMatrix directly from Parquet files

Training

Should I use Ray Tune with XGBoostTrainer or with xgboost_ray.train, running on this Ray on Spark Cluster?

I also intend to implement CV with early stopping. Since tune-sklearn is now deprecated, I understand that I'll need to implement this myself. As explained in ray-project/ray#21848 (comment), this can be done with ray.tune.stopper.TrialPlateauStopper. But according to #301 we can also use XGBoost's native xgb.callback.EarlyStopping. Which approach would you recommend? Can TrialPlateauStopper be used with xgboost_ray?

Thank you very much for any help you can offer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions