Skip to content

Commit 4919aa4

Browse files
[train] add XGBoostTrainer user guide (#52355)
This PR adds a comprehensive guide for distributed XGBoost training with Ray Train. The guide follows the same structure as our other framework-specific guides. The PR also adds a deprecation notice to the older XGBoost/LightGBM notebook and includes the guide in the main Train documentation navigation. In a follow-up PR, we will create a similar guide for LightGBM. --------- Signed-off-by: Matthew Deng <matt@anyscale.com> Signed-off-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com>
1 parent d74d17c commit 4919aa4

File tree

4 files changed

+474
-0
lines changed

4 files changed

+474
-0
lines changed
Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
# flake8: noqa
2+
# isort: skip_file
3+
4+
# __xgboost_start__
5+
import pandas as pd
6+
import xgboost
7+
8+
# 1. Load your data as an `xgboost.DMatrix`.
9+
train_df = pd.read_csv("s3://ray-example-data/iris/train/1.csv")
10+
eval_df = pd.read_csv("s3://ray-example-data/iris/val/1.csv")
11+
12+
train_X = train_df.drop("target", axis=1)
13+
train_y = train_df["target"]
14+
eval_X = eval_df.drop("target", axis=1)
15+
eval_y = eval_df["target"]
16+
17+
dtrain = xgboost.DMatrix(train_X, label=train_y)
18+
deval = xgboost.DMatrix(eval_X, label=eval_y)
19+
20+
# 2. Define your xgboost model training parameters.
21+
params = {
22+
"tree_method": "approx",
23+
"objective": "reg:squarederror",
24+
"eta": 1e-4,
25+
"subsample": 0.5,
26+
"max_depth": 2,
27+
}
28+
29+
# 3. Do non-distributed training.
30+
bst = xgboost.train(
31+
params,
32+
dtrain=dtrain,
33+
evals=[(deval, "validation")],
34+
num_boost_round=10,
35+
)
36+
# __xgboost_end__
37+
38+
39+
# __xgboost_ray_start__
40+
import xgboost
41+
42+
import ray.train
43+
from ray.train.xgboost import XGBoostTrainer, RayTrainReportCallback
44+
45+
# 1. Load your data as a Ray Data Dataset.
46+
train_dataset = ray.data.read_csv("s3://anonymous@ray-example-data/iris/train")
47+
eval_dataset = ray.data.read_csv("s3://anonymous@ray-example-data/iris/val")
48+
49+
50+
def train_func():
51+
# 2. Load your data shard as an `xgboost.DMatrix`.
52+
53+
# Get dataset shards for this worker
54+
train_shard = ray.train.get_dataset_shard("train")
55+
eval_shard = ray.train.get_dataset_shard("eval")
56+
57+
# Convert shards to pandas DataFrames
58+
train_df = train_shard.materialize().to_pandas()
59+
eval_df = eval_shard.materialize().to_pandas()
60+
61+
train_X = train_df.drop("target", axis=1)
62+
train_y = train_df["target"]
63+
eval_X = eval_df.drop("target", axis=1)
64+
eval_y = eval_df["target"]
65+
66+
dtrain = xgboost.DMatrix(train_X, label=train_y)
67+
deval = xgboost.DMatrix(eval_X, label=eval_y)
68+
69+
# 3. Define your xgboost model training parameters.
70+
params = {
71+
"tree_method": "approx",
72+
"objective": "reg:squarederror",
73+
"eta": 1e-4,
74+
"subsample": 0.5,
75+
"max_depth": 2,
76+
}
77+
78+
# 4. Do distributed data-parallel training.
79+
# Ray Train sets up the necessary coordinator processes and
80+
# environment variables for your workers to communicate with each other.
81+
bst = xgboost.train(
82+
params,
83+
dtrain=dtrain,
84+
evals=[(deval, "validation")],
85+
num_boost_round=10,
86+
# Optional: Use the `RayTrainReportCallback` to save and report checkpoints.
87+
callbacks=[RayTrainReportCallback()],
88+
)
89+
90+
91+
# 5. Configure scaling and resource requirements.
92+
scaling_config = ray.train.ScalingConfig(num_workers=2, resources_per_worker={"CPU": 2})
93+
94+
# 6. Launch distributed training job.
95+
trainer = XGBoostTrainer(
96+
train_func,
97+
scaling_config=scaling_config,
98+
datasets={"train": train_dataset, "eval": eval_dataset},
99+
# If running in a multi-node cluster, this is where you
100+
# should configure the run's persistent storage that is accessible
101+
# across all worker nodes.
102+
# run_config=ray.train.RunConfig(storage_path="s3://..."),
103+
)
104+
result = trainer.fit()
105+
106+
# 7. Load the trained model
107+
import os
108+
109+
with result.checkpoint.as_directory() as checkpoint_dir:
110+
model_path = os.path.join(checkpoint_dir, RayTrainReportCallback.CHECKPOINT_NAME)
111+
model = xgboost.Booster()
112+
model.load_model(model_path)
113+
# __xgboost_ray_end__

doc/source/train/examples/xgboost/distributed-xgboost-lightgbm.ipynb

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,9 @@
1414
"\n",
1515
"(train-gbdt-guide)=\n",
1616
"\n",
17+
"> **Note**: The API shown in this notebook is now deprecated. Please refer to the updated API in [Getting Started with Distributed Training using XGBoost](../../getting-started-xgboost.rst) instead.\n",
18+
"\n",
19+
"\n",
1720
"In this tutorial, you'll discover how to scale out data preprocessing, training, and inference with XGBoost and LightGBM on Ray.\n",
1821
"\n",
1922
"To run this tutorial, we need to install the following dependencies:\n",

0 commit comments

Comments
 (0)