[jvm-packages] Training crashes with a very wide feature matrix input read from libsvm #5041

kuanghan · 2019-11-15T22:48:42Z

Hi,

First of all, I'd like to thank you all for developing such a useful package. I am running into memory issues while training a regression model using xgboost4j-spark on Databricks. I'm using Databricks runtime 5.5 ML, which has xgboost4j-spark 0.90, Spark 2.4.3, and Scala 2.11). The summary is that I read the features and responses from a file in the LIBSVM format, and training crashes because it's using too much memory on my cluster. I contacted Databricks support but they couldn't figure it out either, so I'm trying here to see if you have any idea how to solve this issue.

Our feature matrix has a lot of columns (more than 70,000 columns) and about 390K rows, but in sparse LIBSVM format it takes up about 3.5 GB of disk space. The thing is that we've been training using the native R xgboost package for a while, and with the same feature matrix & hyperparameters the memory usage hovers around 30 GB during training, and training finished within a reasonable amount of time. However, when using xgboost4j-spark, the total cluster memory usage shoots up to about 400 GB (the cluster is scaled up to using 8 worker nodes) right before it crashes.

I have some example Scala code below that reproduces the memory errors that we've been seeing. If it helps you debug, I am happy to send you some randomized data as well.

import org.apache.spark.ml.source.libsvm.LibSVMDataSource
import ml.dmlc.xgboost4j.scala.spark.{XGBoostRegressor,XGBoostRegressionModel,XGBoostClassifier,XGBoostClassificationModel}

val train = spark.sqlContext.read.format("libsvm")
     .load("dbfs:/tmp/train_randomized.libsvm")
     .select("features", "label")
     .repartition(48)

val valid = spark.sqlContext.read.format("libsvm")
     .load("dbfs:/tmp/valid_randomized.libsvm")
     .select("features", "label")
     .repartition(48)

val xgbParam = Map("eta" -> 0.0314,
                   "missing" -> 0,
                   "max_depth" -> 16,
                   "min_child_weight" -> 8.532,
                   "subsample" -> 0.667,
                   "colsample_bytree" -> 0.855,
                   "lambda" -> 0,
                   "alpha" -> 0,
                   "scale_pos_weight" -> 1,
                   "num_workers" -> 48,
                   "num_round" -> 5,
                   "eval_metric" -> "mae",
                   "objective" -> "reg:tweedie",
                   "num_early_stopping_rounds" -> 25,
                   "maximize_evaluation_metrics" -> false,
                   "use_external_memory" -> true,
                   "eval_sets" -> Map("eval_set" -> valid))

val xgbRegressor = new XGBoostRegressor(xgbParam)
  .setFeaturesCol("features")
  .setLabelCol("label")

val xgbRegressionModel = xgbRegressor.fit(train)

Error messages are below:

at ml.dmlc.xgboost4j.scala.spark.XGBoost$.ml$dmlc$xgboost4j$scala$spark$XGBoost$$postTrackerReturnProcessing(XGBoost.scala:582)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$2.apply(XGBoost.scala:459)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$2.apply(XGBoost.scala:435)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.immutable.List.map(List.scala:296)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:434)
	at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor.train(XGBoostRegressor.scala:190)
	at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor.train(XGBoostRegressor.scala:48)
	at org.apache.spark.ml.Predictor.fit(Predictor.scala:120)
	at line2dd35d331b27446594a0cf2e67789c5238.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-5894:1)
	at line2dd35d331b27446594a0cf2e67789c5238.$read$$iw$$iw$$iw$$iw$$iw.<init>(command-5894:52)
	at line2dd35d331b27446594a0cf2e67789c5238.$read$$iw$$iw$$iw$$iw.<init>(command-5894:54)
	at line2dd35d331b27446594a0cf2e67789c5238.$read$$iw$$iw$$iw.<init>(command-5894:56)
	at line2dd35d331b27446594a0cf2e67789c5238.$read$$iw$$iw.<init>(command-5894:58)
	at line2dd35d331b27446594a0cf2e67789c5238.$read$$iw.<init>(command-5894:60)
	at line2dd35d331b27446594a0cf2e67789c5238.$read.<init>(command-5894:62)
	at line2dd35d331b27446594a0cf2e67789c5238.$read$.<init>(command-5894:66)
	at line2dd35d331b27446594a0cf2e67789c5238.$read$.<clinit>(command-5894)
	at line2dd35d331b27446594a0cf2e67789c5238.$eval$.$print$lzycompute(<notebook>:7)
	at line2dd35d331b27446594a0cf2e67789c5238.$eval$.$print(<notebook>:6)
	at line2dd35d331b27446594a0cf2e67789c5238.$eval.$print(<notebook>)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:793)
	at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1054)
	at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:645)
	at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:644)
	at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
	at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
	at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:644)
	at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:576)
	at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:572)
	at com.databricks.backend.daemon.driver.DriverILoop.execute(DriverILoop.scala:215)
	at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply$mcV$sp(ScalaDriverLocal.scala:197)
	at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply(ScalaDriverLocal.scala:197)
	at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply(ScalaDriverLocal.scala:197)
	at com.databricks.backend.daemon.driver.DriverLocal$TrapExitInternal$.trapExit(DriverLocal.scala:679)
	at com.databricks.backend.daemon.driver.DriverLocal$TrapExit$.apply(DriverLocal.scala:632)
	at com.databricks.backend.daemon.driver.ScalaDriverLocal.repl(ScalaDriverLocal.scala:197)
	at com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$8.apply(DriverLocal.scala:368)
	at com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$8.apply(DriverLocal.scala:345)
	at com.databricks.logging.UsageLogging$$anonfun$withAttributionContext$1.apply(UsageLogging.scala:238)
	at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
	at com.databricks.logging.UsageLogging$class.withAttributionContext(UsageLogging.scala:233)
	at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:48)
	at com.databricks.logging.UsageLogging$class.withAttributionTags(UsageLogging.scala:271)
	at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:48)
	at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:345)
	at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:644)
	at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:644)
	at scala.util.Try$.apply(Try.scala:192)
	at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:639)
	at com.databricks.backend.daemon.driver.DriverWrapper.getCommandOutputAndError(DriverWrapper.scala:485)
	at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:597)
	at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:390)
	at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:337)
	at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:219)
	at java.lang.Thread.run(Thread.java:748)

We saw a sort of related issue #5040, and we did try to repartition our training & validation sets to the same number as workers, but that didn't help. It's puzzling to us that using the same training data & hyper parameters, xgboost4j-spark seems to be using way more memory than the native R package. Any help you can provide will be very much appreciated!!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jvm-packages] Training crashes with a very wide feature matrix input read from libsvm #5041

[jvm-packages] Training crashes with a very wide feature matrix input read from libsvm #5041

kuanghan commented Nov 15, 2019 •

edited

Loading

[jvm-packages] Training crashes with a very wide feature matrix input read from libsvm #5041

[jvm-packages] Training crashes with a very wide feature matrix input read from libsvm #5041

Comments

kuanghan commented Nov 15, 2019 • edited Loading

kuanghan commented Nov 15, 2019 •

edited

Loading