Replies: 1 comment 10 replies
-
@Nikhilpa1, you passed: |
Beta Was this translation helpful? Give feedback.
10 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
While I try to run a spark job on GPU with RAPIDS Shuffle Manager
UCX 1.14
Spark 3.2.0
rapids-4-spark_2.12-22.10.0
with configuration
$SPARK_HOME/bin/spark-submit
--master spark://${MASTER_HOST}:7077
--conf spark.executor.extraClassPath=${SPARK_RAPIDS_PLUGIN_JAR}
--conf spark.driver.extraClassPath=${SPARK_RAPIDS_PLUGIN_JAR}
--conf spark.plugins=com.nvidia.spark.SQLPlugin
--conf spark.shuffle.manager=com.nvidia.spark.rapids.spark320.RapidsShuffleManager
--conf spark.rapids.sql.concurrentGpuTasks=1
--conf spark.driver.memory=64G
--conf spark.executor.memory=64G
--conf spark.executor.cores=1
--conf spark.executor.instances=1
--conf spark.task.cpus=1
--conf spark.executor.resource.gpu.amount=1
--conf spark.task.resource.gpu.amount=0.25
--conf spark.rapids.memory.pinnedPool.size=1G
--conf spark.sql.files.maxPartitionBytes=128m
--conf spark.rapids.shuffle.mode=UCX
--conf spark.shuffle.service.enabled=false
--conf spark.dynamicAllocation.enabled=false
--conf spark.executorEnv.LD_LIBRARY_PATH=/home/kanaka.3/others/ucx/ucx-1.14-ins/lib:/home/kanaka.3/others/knem/knem-1.1.4-ins/lib
--conf spark.driverEnv.LD_LIBRARY_PATH=/home/kanaka.3/others/ucx/ucx-1.14-ins/lib::/home/kanaka.3/others/knem/knem-1.1.4-ins/lib
--conf spark.executorEnv.UCX_ERROR_SIGNALS=
--conf spark.executorEnv.UCX_MEMTYPE_CACHE=n
--conf spark.executorEnv.UCX_IB_RX_QUEUE_LEN=1024
--conf spark.executorEnv.UCX_TLS=rc_x,cuda_copy,cuda_ipc
--conf spark.executorEnv.UCX_RNDV_SCHEME=put_zcopy
--conf spark.executorEnv.UCX_MAX_RNDV_RAILS=1
${SPARK_JOBS}/myjob.py
I'm facing connection error
0/stderr:75:23/04/17 12:06:13 INFO UCXClientConnection: UCX Client UCXClientConnection(ucx=com.nvidia.spark.rapids.shuffle.ucx.UCX@100838be, peerExecutorId=1) started
0/stderr:76:23/04/17 12:06:13 ERROR UCX: UcpListener detected an error for executorId 1: UCXError(-6,Destination is unreachable)
0/stderr:77:23/04/17 12:06:13 WARN UCX: Removing endpoint UcpEndpoint(id=47937671962800, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,socketAddress=/10.1.1.3:4195,) for 1
0/stderr:78:23/04/17 12:06:13 WARN UCX: Removed stale client connection for 1
0/stderr:79:23/04/17 12:06:13 ERROR UCX: Error while closing ep. Ignoring.
0/stderr:80:org.openucx.jucx.UcxException: Destination is unreachable
0/stderr:81: at org.openucx.jucx.ucp.UcpEndpoint.closeNonBlockingNative(Native Method)
0/stderr:82: at org.openucx.jucx.ucp.UcpEndpoint.closeNonBlockingFlush(UcpEndpoint.java:376)
0/stderr:83: at com.nvidia.spark.rapids.shuffle.ucx.UCX$UcpEndpointManager.$anonfun$closeEndpointOnWorkerThread$1(UCX.scala:900)
0/stderr:84: at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$5(UCX.scala:184)
0/stderr:85: at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$5$adapted(UCX.scala:178)
0/stderr:86: at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
0/stderr:87: at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
0/stderr:88: at com.nvidia.spark.rapids.shuffle.ucx.UCX.withResource(UCX.scala:69)
0/stderr:89: at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$2(UCX.scala:178)
0/stderr:90: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
0/stderr:91: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
0/stderr:92: at com.nvidia.spark.rapids.GpuDeviceManager$$anon$1.$anonfun$newThread$1(GpuDeviceManager.scala:345)
0/stderr:93: at java.lang.Thread.run(Thread.java:748)
I've checked on UCX communication by validating the nodes with ucx_perftest. (UCX_NET_DEVICES=mlx5_0:1 UCX_SOCKADDR_TLS_PRIORITY=sockcm UCX_TLS=rc_x,cuda_copy,cuda_ipc ./ucx_perftest -t tag_bw -m cuda -n 4000 -w 500 -c 0 -s 134217728 -p 13381)
Beta Was this translation helpful? Give feedback.
All reactions