-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
[Spark 3.5.0] Rabit Tracker Connection Failure During Distributed XGBoost Training #11380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We have been running tests with 3.5 but haven't observed similar error yet. The errors come from |
Or, can you test on your environment using the latest xgboost4j 3.0 from https://s3-us-west-2.amazonaws.com/xgboost-maven-repo/list.html?prefix=release/ml/dmlc/xgboost4j-spark_2.12/3.0.0/ |
During debugging a Spark-XGBoost pipeline, I encountered a version-specific warning that never appeared in previous environments: Warning Log:
Resolution: Disabling dynamic resource allocation resolved this conflict:
Subsequently, a stricter data validation error emerged during training: Error Log:
Resolution: Bypassing the validation check enabled successful training: Btw, I confirm my training set uses dense vectors (DenseVector) and contains no NaN values, yet allow_non_zero_for_missing must still be set. |
@wbo4958 could you please help take a look when you are available? |
Hi @lcx517, Looks like you're not using the latest xgboost, since allow_non_zero_for_missing has been totally removed. |
Understood. I'm using the highest version 2.1.4 under Java 1.8. Since my cluster environment cannot support JRE 54+, I'm unable to validate the allow_non_zero_for_missing parameter in version 3.+. |
looks like xgboost 3.x still can run with java 8. |
I'm sorry I remembered incorrectly. It's other third-party libraries in version 3.+ that are based on Java 11.
|
"ERROR DataBatch: java.lang.IllegalArgumentException: requirement failed: indices and values must have the same number of elements", looks like something wrong about converting the data from spark into xgboost databatch. Could you help share the dataset you are using or a synthetic dataset so we can triage it? Thx very much @lcx517 |
Thank you very much |
Environment Details
Background
Our pipeline ran successfully with Spark 3.1.1 + XGBoost 1.1.1 in production. After upgrading to Spark 3.5.0, we tested multiple XGBoost versions (2.1.0-2.1.4) and consistently encountered the same Rabit tracker connection error during distributed training.
Error Description
Failure occurs when initializing distributed training:
Full stack trace shows the error originates from RabitTracker.stop() after connection rejection.
Reproduction Steps
spark-submit --master yarn --deploy-mode cluster ...
Attempted Fixes
✅ Verified compatibility between Spark 3.5.0 and XGBoost 2.1.x
✅ Tested all minor versions of XGBoost 2.1.x series
❌ Adjusting tracker ports (tracker_conf) had no effect
❌ Increasing timeout (timeout parameter) failed
Key Questions
This template focuses on critical version conflicts and provides actionable context for maintainers.
attaching log:
The text was updated successfully, but these errors were encountered: