Documentation: train + RayDMatrix vs XGBoostTrainer + Dataset

Could you please provide some clarification on the differences and/or how to choose between using `xgboost_ray.train` + `xgboost_ray.RayDMatrix` or `ray.train.xgboost.XGBoostTrainer` + `ray.data.Dataset`?

My use case is running Ray Tune on Azure Databricks, which operates on Spark. According to the [Databricks docs](https://docs.databricks.com/en/machine-learning/ray-integration.html), one creates a Ray Cluster using the Ray on Spark API, and creates a Ray Dataset from Parquet files.

Below are the questions I would like clarification on. Any help you could provide would be greatly appreciated.

#### Data
According to the `README.md` one can create a `RayDMatrix` from either Parquet files or a Ray `Dataset`: https://github.com/ray-project/xgboost_ray/blob/e9049256575e5bdd956b369cf86e94a298d11048/README.md?plain=1#L450-L465
So if using `xgboost_ray`, should I
- Create a Ray `Dataset` from Parquet files, then create a `RayDMatrix` from that `Dataset`
or
- Create the `RayDMatrix` directly from Parquet files

#### Training
Should I use Ray Tune with `XGBoostTrainer` or with `xgboost_ray.train`, running on this Ray on Spark Cluster?

I also intend to implement CV with early stopping. Since `tune-sklearn` is now deprecated, I understand that I'll need to implement this myself. As explained in https://github.com/ray-project/ray/issues/21848#issuecomment-1379600735, this can be done with `ray.tune.stopper.TrialPlateauStopper`. But according to #301 we can also use XGBoost's native `xgb.callback.EarlyStopping`. Which approach would you recommend? Can `TrialPlateauStopper` be used with `xgboost_ray`?

Thank you very much for any help you can offer.

	### Data sources

	The following data sources can be used with a `RayDMatrix` object.

	\| Type \| Centralized loading \| Distributed loading \|
	\|------------------------------------------------------------------\|---------------------\|---------------------\|
	\| Numpy array \| Yes \| No \|
	\| Pandas dataframe \| Yes \| No \|
	\| Single CSV \| Yes \| No \|
	\| Multi CSV \| Yes \| Yes \|
	\| Single Parquet \| Yes \| No \|
	\| Multi Parquet \| Yes \| Yes \|
	\| [Ray Dataset](https://docs.ray.io/en/latest/data/dataset.html) \| Yes \| Yes \|
	\| [Petastorm](https://github.com/uber/petastorm) \| Yes \| Yes \|
	\| [Dask dataframe](https://docs.dask.org/en/latest/dataframe.html) \| Yes \| Yes \|
	\| [Modin dataframe](https://modin.readthedocs.io/en/latest/) \| Yes \| Yes \|

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Documentation: train + RayDMatrix vs XGBoostTrainer + Dataset #309

Data

Training

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Documentation: train + RayDMatrix vs XGBoostTrainer + Dataset #309

Description

Data

Training

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions