-
Notifications
You must be signed in to change notification settings - Fork 36
Description
Could you please provide some clarification on the differences and/or how to choose between using xgboost_ray.train
+ xgboost_ray.RayDMatrix
or ray.train.xgboost.XGBoostTrainer
+ ray.data.Dataset
?
My use case is running Ray Tune on Azure Databricks, which operates on Spark. According to the Databricks docs, one creates a Ray Cluster using the Ray on Spark API, and creates a Ray Dataset from Parquet files.
Below are the questions I would like clarification on. Any help you could provide would be greatly appreciated.
Data
According to the README.md
one can create a RayDMatrix
from either Parquet files or a Ray Dataset
:
Lines 450 to 465 in e904925
### Data sources | |
The following data sources can be used with a `RayDMatrix` object. | |
| Type | Centralized loading | Distributed loading | | |
|------------------------------------------------------------------|---------------------|---------------------| | |
| Numpy array | Yes | No | | |
| Pandas dataframe | Yes | No | | |
| Single CSV | Yes | No | | |
| Multi CSV | Yes | Yes | | |
| Single Parquet | Yes | No | | |
| Multi Parquet | Yes | Yes | | |
| [Ray Dataset](https://docs.ray.io/en/latest/data/dataset.html) | Yes | Yes | | |
| [Petastorm](https://github.com/uber/petastorm) | Yes | Yes | | |
| [Dask dataframe](https://docs.dask.org/en/latest/dataframe.html) | Yes | Yes | | |
| [Modin dataframe](https://modin.readthedocs.io/en/latest/) | Yes | Yes | |
So if using
xgboost_ray
, should I
- Create a Ray
Dataset
from Parquet files, then create aRayDMatrix
from thatDataset
or - Create the
RayDMatrix
directly from Parquet files
Training
Should I use Ray Tune with XGBoostTrainer
or with xgboost_ray.train
, running on this Ray on Spark Cluster?
I also intend to implement CV with early stopping. Since tune-sklearn
is now deprecated, I understand that I'll need to implement this myself. As explained in ray-project/ray#21848 (comment), this can be done with ray.tune.stopper.TrialPlateauStopper
. But according to #301 we can also use XGBoost's native xgb.callback.EarlyStopping
. Which approach would you recommend? Can TrialPlateauStopper
be used with xgboost_ray
?
Thank you very much for any help you can offer.