Skip to content

[jvm-packages] Training crashes with a very wide feature matrix input read from libsvm #5041

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
kuanghan opened this issue Nov 15, 2019 · 0 comments

Comments

@kuanghan
Copy link

kuanghan commented Nov 15, 2019

Hi,

First of all, I'd like to thank you all for developing such a useful package. I am running into memory issues while training a regression model using xgboost4j-spark on Databricks. I'm using Databricks runtime 5.5 ML, which has xgboost4j-spark 0.90, Spark 2.4.3, and Scala 2.11). The summary is that I read the features and responses from a file in the LIBSVM format, and training crashes because it's using too much memory on my cluster. I contacted Databricks support but they couldn't figure it out either, so I'm trying here to see if you have any idea how to solve this issue.

Our feature matrix has a lot of columns (more than 70,000 columns) and about 390K rows, but in sparse LIBSVM format it takes up about 3.5 GB of disk space. The thing is that we've been training using the native R xgboost package for a while, and with the same feature matrix & hyperparameters the memory usage hovers around 30 GB during training, and training finished within a reasonable amount of time. However, when using xgboost4j-spark, the total cluster memory usage shoots up to about 400 GB (the cluster is scaled up to using 8 worker nodes) right before it crashes.

I have some example Scala code below that reproduces the memory errors that we've been seeing. If it helps you debug, I am happy to send you some randomized data as well.

import org.apache.spark.ml.source.libsvm.LibSVMDataSource
import ml.dmlc.xgboost4j.scala.spark.{XGBoostRegressor,XGBoostRegressionModel,XGBoostClassifier,XGBoostClassificationModel}

val train = spark.sqlContext.read.format("libsvm")
     .load("dbfs:/tmp/train_randomized.libsvm")
     .select("features", "label")
     .repartition(48)

val valid = spark.sqlContext.read.format("libsvm")
     .load("dbfs:/tmp/valid_randomized.libsvm")
     .select("features", "label")
     .repartition(48)

val xgbParam = Map("eta" -> 0.0314,
                   "missing" -> 0,
                   "max_depth" -> 16,
                   "min_child_weight" -> 8.532,
                   "subsample" -> 0.667,
                   "colsample_bytree" -> 0.855,
                   "lambda" -> 0,
                   "alpha" -> 0,
                   "scale_pos_weight" -> 1,
                   "num_workers" -> 48,
                   "num_round" -> 5,
                   "eval_metric" -> "mae",
                   "objective" -> "reg:tweedie",
                   "num_early_stopping_rounds" -> 25,
                   "maximize_evaluation_metrics" -> false,
                   "use_external_memory" -> true,
                   "eval_sets" -> Map("eval_set" -> valid))

val xgbRegressor = new XGBoostRegressor(xgbParam)
  .setFeaturesCol("features")
  .setLabelCol("label")

val xgbRegressionModel = xgbRegressor.fit(train)

Error messages are below:

at ml.dmlc.xgboost4j.scala.spark.XGBoost$.ml$dmlc$xgboost4j$scala$spark$XGBoost$$postTrackerReturnProcessing(XGBoost.scala:582)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$2.apply(XGBoost.scala:459)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$2.apply(XGBoost.scala:435)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.immutable.List.map(List.scala:296)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:434)
	at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor.train(XGBoostRegressor.scala:190)
	at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor.train(XGBoostRegressor.scala:48)
	at org.apache.spark.ml.Predictor.fit(Predictor.scala:120)
	at line2dd35d331b27446594a0cf2e67789c5238.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-5894:1)
	at line2dd35d331b27446594a0cf2e67789c5238.$read$$iw$$iw$$iw$$iw$$iw.<init>(command-5894:52)
	at line2dd35d331b27446594a0cf2e67789c5238.$read$$iw$$iw$$iw$$iw.<init>(command-5894:54)
	at line2dd35d331b27446594a0cf2e67789c5238.$read$$iw$$iw$$iw.<init>(command-5894:56)
	at line2dd35d331b27446594a0cf2e67789c5238.$read$$iw$$iw.<init>(command-5894:58)
	at line2dd35d331b27446594a0cf2e67789c5238.$read$$iw.<init>(command-5894:60)
	at line2dd35d331b27446594a0cf2e67789c5238.$read.<init>(command-5894:62)
	at line2dd35d331b27446594a0cf2e67789c5238.$read$.<init>(command-5894:66)
	at line2dd35d331b27446594a0cf2e67789c5238.$read$.<clinit>(command-5894)
	at line2dd35d331b27446594a0cf2e67789c5238.$eval$.$print$lzycompute(<notebook>:7)
	at line2dd35d331b27446594a0cf2e67789c5238.$eval$.$print(<notebook>:6)
	at line2dd35d331b27446594a0cf2e67789c5238.$eval.$print(<notebook>)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:793)
	at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1054)
	at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:645)
	at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:644)
	at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
	at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
	at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:644)
	at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:576)
	at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:572)
	at com.databricks.backend.daemon.driver.DriverILoop.execute(DriverILoop.scala:215)
	at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply$mcV$sp(ScalaDriverLocal.scala:197)
	at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply(ScalaDriverLocal.scala:197)
	at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply(ScalaDriverLocal.scala:197)
	at com.databricks.backend.daemon.driver.DriverLocal$TrapExitInternal$.trapExit(DriverLocal.scala:679)
	at com.databricks.backend.daemon.driver.DriverLocal$TrapExit$.apply(DriverLocal.scala:632)
	at com.databricks.backend.daemon.driver.ScalaDriverLocal.repl(ScalaDriverLocal.scala:197)
	at com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$8.apply(DriverLocal.scala:368)
	at com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$8.apply(DriverLocal.scala:345)
	at com.databricks.logging.UsageLogging$$anonfun$withAttributionContext$1.apply(UsageLogging.scala:238)
	at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
	at com.databricks.logging.UsageLogging$class.withAttributionContext(UsageLogging.scala:233)
	at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:48)
	at com.databricks.logging.UsageLogging$class.withAttributionTags(UsageLogging.scala:271)
	at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:48)
	at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:345)
	at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:644)
	at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:644)
	at scala.util.Try$.apply(Try.scala:192)
	at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:639)
	at com.databricks.backend.daemon.driver.DriverWrapper.getCommandOutputAndError(DriverWrapper.scala:485)
	at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:597)
	at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:390)
	at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:337)
	at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:219)
	at java.lang.Thread.run(Thread.java:748)

We saw a sort of related issue #5040, and we did try to repartition our training & validation sets to the same number as workers, but that didn't help. It's puzzling to us that using the same training data & hyper parameters, xgboost4j-spark seems to be using way more memory than the native R package. Any help you can provide will be very much appreciated!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant