You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all, I'd like to thank you all for developing such a useful package. I am running into memory issues while training a regression model using xgboost4j-spark on Databricks. I'm using Databricks runtime 5.5 ML, which has xgboost4j-spark 0.90, Spark 2.4.3, and Scala 2.11). The summary is that I read the features and responses from a file in the LIBSVM format, and training crashes because it's using too much memory on my cluster. I contacted Databricks support but they couldn't figure it out either, so I'm trying here to see if you have any idea how to solve this issue.
Our feature matrix has a lot of columns (more than 70,000 columns) and about 390K rows, but in sparse LIBSVM format it takes up about 3.5 GB of disk space. The thing is that we've been training using the native R xgboost package for a while, and with the same feature matrix & hyperparameters the memory usage hovers around 30 GB during training, and training finished within a reasonable amount of time. However, when using xgboost4j-spark, the total cluster memory usage shoots up to about 400 GB (the cluster is scaled up to using 8 worker nodes) right before it crashes.
I have some example Scala code below that reproduces the memory errors that we've been seeing. If it helps you debug, I am happy to send you some randomized data as well.
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.ml$dmlc$xgboost4j$scala$spark$XGBoost$$postTrackerReturnProcessing(XGBoost.scala:582)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$2.apply(XGBoost.scala:459)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$2.apply(XGBoost.scala:435)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:296)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:434)
at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor.train(XGBoostRegressor.scala:190)
at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor.train(XGBoostRegressor.scala:48)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:120)
at line2dd35d331b27446594a0cf2e67789c5238.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-5894:1)
at line2dd35d331b27446594a0cf2e67789c5238.$read$$iw$$iw$$iw$$iw$$iw.<init>(command-5894:52)
at line2dd35d331b27446594a0cf2e67789c5238.$read$$iw$$iw$$iw$$iw.<init>(command-5894:54)
at line2dd35d331b27446594a0cf2e67789c5238.$read$$iw$$iw$$iw.<init>(command-5894:56)
at line2dd35d331b27446594a0cf2e67789c5238.$read$$iw$$iw.<init>(command-5894:58)
at line2dd35d331b27446594a0cf2e67789c5238.$read$$iw.<init>(command-5894:60)
at line2dd35d331b27446594a0cf2e67789c5238.$read.<init>(command-5894:62)
at line2dd35d331b27446594a0cf2e67789c5238.$read$.<init>(command-5894:66)
at line2dd35d331b27446594a0cf2e67789c5238.$read$.<clinit>(command-5894)
at line2dd35d331b27446594a0cf2e67789c5238.$eval$.$print$lzycompute(<notebook>:7)
at line2dd35d331b27446594a0cf2e67789c5238.$eval$.$print(<notebook>:6)
at line2dd35d331b27446594a0cf2e67789c5238.$eval.$print(<notebook>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:793)
at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1054)
at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:645)
at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:644)
at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:644)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:576)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:572)
at com.databricks.backend.daemon.driver.DriverILoop.execute(DriverILoop.scala:215)
at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply$mcV$sp(ScalaDriverLocal.scala:197)
at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply(ScalaDriverLocal.scala:197)
at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply(ScalaDriverLocal.scala:197)
at com.databricks.backend.daemon.driver.DriverLocal$TrapExitInternal$.trapExit(DriverLocal.scala:679)
at com.databricks.backend.daemon.driver.DriverLocal$TrapExit$.apply(DriverLocal.scala:632)
at com.databricks.backend.daemon.driver.ScalaDriverLocal.repl(ScalaDriverLocal.scala:197)
at com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$8.apply(DriverLocal.scala:368)
at com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$8.apply(DriverLocal.scala:345)
at com.databricks.logging.UsageLogging$$anonfun$withAttributionContext$1.apply(UsageLogging.scala:238)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at com.databricks.logging.UsageLogging$class.withAttributionContext(UsageLogging.scala:233)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:48)
at com.databricks.logging.UsageLogging$class.withAttributionTags(UsageLogging.scala:271)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:48)
at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:345)
at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:644)
at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:644)
at scala.util.Try$.apply(Try.scala:192)
at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:639)
at com.databricks.backend.daemon.driver.DriverWrapper.getCommandOutputAndError(DriverWrapper.scala:485)
at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:597)
at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:390)
at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:337)
at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:219)
at java.lang.Thread.run(Thread.java:748)
We saw a sort of related issue #5040, and we did try to repartition our training & validation sets to the same number as workers, but that didn't help. It's puzzling to us that using the same training data & hyper parameters, xgboost4j-spark seems to be using way more memory than the native R package. Any help you can provide will be very much appreciated!!
The text was updated successfully, but these errors were encountered:
Hi,
First of all, I'd like to thank you all for developing such a useful package. I am running into memory issues while training a regression model using
xgboost4j-spark
on Databricks. I'm using Databricks runtime 5.5 ML, which hasxgboost4j-spark
0.90, Spark 2.4.3, and Scala 2.11). The summary is that I read the features and responses from a file in the LIBSVM format, and training crashes because it's using too much memory on my cluster. I contacted Databricks support but they couldn't figure it out either, so I'm trying here to see if you have any idea how to solve this issue.Our feature matrix has a lot of columns (more than 70,000 columns) and about 390K rows, but in sparse LIBSVM format it takes up about 3.5 GB of disk space. The thing is that we've been training using the native R xgboost package for a while, and with the same feature matrix & hyperparameters the memory usage hovers around 30 GB during training, and training finished within a reasonable amount of time. However, when using
xgboost4j-spark
, the total cluster memory usage shoots up to about 400 GB (the cluster is scaled up to using 8 worker nodes) right before it crashes.I have some example Scala code below that reproduces the memory errors that we've been seeing. If it helps you debug, I am happy to send you some randomized data as well.
Error messages are below:
We saw a sort of related issue #5040, and we did try to repartition our training & validation sets to the same number as workers, but that didn't help. It's puzzling to us that using the same training data & hyper parameters,
xgboost4j-spark
seems to be using way more memory than the native R package. Any help you can provide will be very much appreciated!!The text was updated successfully, but these errors were encountered: