problems about the Parallel training of multi-GPUs #4693

HRHTer · 2025-04-01T17:33:44Z

HRHTer
Apr 1, 2025

I 've use two gpus for the parallel training However, both GPUs occupy less than 50 percent, the log can be seen as below:
2025-04-02 00:13:18.826635: I tensorflow/core/util/port.cc:111] 2025-04-02 00:13:18.863257: E tensorflow/compiler/x 2025-04-02 00:13:18.863282: E tensorflow/compiler/x 2025-04-02 00:13:18.863310: E tensorflow/compiler/x 2025-04-02 00:13:18.870034: I tensorflow/core/platf To enable the following instructions: AVX2 2025-04-02 00:13:19.668900: W tensorflow/compiler/t [1,0]:2025-04-02 00:13:21.224831: I tensorflow/core [1,1]:2025-04-02 00:13:21.224831: I tensorflow/core [1,0]:2025-04-02 00:13:21.260859: E tensorflow/comp [1,0]:2025-04-02 00:13:21.260883: E tensorflow/comp [1,1]:2025-04-02 00:13:21.260858: E tensorflow/comp [1,1]:2025-04-02 00:13:21.260883: E tensorflow/comp [1,0]:2025-04-02 00:13:21.260914: E tensorflow/comp [1,1]:2025-04-02 00:13:21.260914: E tensorflow/comp [1,0]:2025-04-02 00:13:21.268204: I tensorflow/core [1,0]:To enable the following instructions: [1,1]:2025-04-02 00:13:21.268202: I tensorflow/core [1,1]:To enable the following instructions: [1,0]:2025-04-02 00:13:21.985333: W tensorflow/comp [1,1]:2025-04-02 00:13:21.985417: W tensorflow/comp [1,1]:To get the best performance, it is recommended [1,0]:To get the best performance, it is recommended [1,0]:[2025-04-02 00:13:26,369] DEEPMD rank:0 INFO [1,0]:[2025-04-02 00:13:26,458] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:26,498] DEEPMD rank:1 INFO [1,1]:[2025-04-02 00:13:26,589] DEEPMD rank:1 INFO [1,0]:2025-04-02 00:13:26.595129: I tensorflow/comp [1,0]:[2025-04-02 00:13:26,716] DEEPMD rank:0 INFO [1,0]:[2025-04-02 00:13:26,727] DEEPMD rank:0 INFO [1,1]:2025-04-02 00:13:26.740940: I tensorflow/comp [1,0]:[2025-04-02 00:13:26,751] DEEPMD rank:0 INFO [1,0]:[2025-04-02 00:13:26,791] DEEPMD rank:0 INFO [1,0]:[2025-04-02 00:13:26,984] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:27,063] DEEPMD rank:1 INFO [1,1]:[2025-04-02 00:13:27,074] DEEPMD rank:1 INFO [1,1]:[2025-04-02 00:13:27,262] DEEPMD rank:1 INFO [1,1]:[2025-04-02 00:13:27,560] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:27,561] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:27,771] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:28,765] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:28,765] DEEPMD rank:1 INFO [1,1]:[2025-04-02 00:13:29,967] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:31,172] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:32,668] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:35,272] DEEPMD rank:0 INFO [1,0]:[2025-04-02 00:13:35,273] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,181] DEEPMD rank:1 INFO [1,1]:[2025-04-02 00:13:36,182] DEEPMD rank:1 INFO [1,1]:[2025-04-02 00:13:36,185] DEEPMD rank:1 INFO [1,1]:[2025-04-02 00:13:36,185] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,185] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,185] DEEPMD rank:1 INFO [1,1]:[2025-04-02 00:13:36,185] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,185] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,185] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,185] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,185] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,185] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,185] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,185] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,185] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,185] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,185] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,185] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,185] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,185] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,185] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,185] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,185] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,185] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,185] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,185] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,185] DEEPMD rank:1 INFO [1,1]:[2025-04-02 00:13:36,185] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,186] DEEPMD rank:1 INFO [1,1]:[2025-04-02 00:13:36,186] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,186] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,186] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,186] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,186] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,186] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,186] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,186] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,186] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,186] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,186] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,186] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,186] DEEPMD rank:1 INFO [1,1]:[2025-04-02 00:13:36,186] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,186] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,186] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,186] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO [1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO [1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO [1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO [1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO [1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,235] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,235] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,235] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,235] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,235] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,235] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,235] DEEPMD rank:1 INFO [1,1]:[2025-04-02 00:13:36,235] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,235] DEEPMD rank:0 INFO [1,0]:[2025-04-02 00:13:36,235] DEEPMD rank:0 INFO [1,0]:[2025-04-02 00:13:36,243] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,243] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,243] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,243] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,243] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,243] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,243] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,243] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,243] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,243] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,243] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,243] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,243] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:36,243] DEEPMD rank:1 INFO [1,1]:[2025-04-02 00:13:36,349] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:36,357] DEEPMD rank:0 INFO [1,0]:[2025-04-02 00:13:37,086] DEEPMD rank:0 INFO [1,0]:[2025-04-02 00:13:37,086] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:37,093] DEEPMD rank:1 INFO [1,1]:[2025-04-02 00:13:37,093] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:38,400] DEEPMD rank:0 INFO [1,0]:2025-04-02 00:13:38.570861: I tensorflow/core [1,1]:[2025-04-02 00:13:38,600] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:38,617] DEEPMD rank:0 INFO [1,1]:2025-04-02 00:13:38.800087: I tensorflow/core [1,0]:[2025-04-02 00:13:38,947] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:13:39,122] DEEPMD rank:1 INFO [1,1]:[2025-04-02 00:13:39,816] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:39,852] DEEPMD rank:0 INFO [1,0]:[2025-04-02 00:13:40,551] DEEPMD rank:0 INFO batch [1,0]:[2025-04-02 00:13:40,552] DEEPMD rank:0 INFO batch [1,1]:2025-04-02 00:13:40.916067: I tensorflow/core [1,0]:2025-04-02 00:13:41.548210: I tensorflow/core [1,1]:[2025-04-02 00:13:52,929] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:13:52,936] DEEPMD rank:0 INFO [1,0]:[2025-04-02 00:13:52,936] DEEPMD rank:0 INFO [1,0]:[2025-04-02 00:13:52,936] DEEPMD rank:0 INFO [1,1]:[2025-04-02 00:14:04,382] DEEPMD rank:1 INFO [1,0]:[2025-04-02 00:14:04,389] DEEPMD rank:0 INFO [1,0]:[2025-04-02 00:14:04,389] DEEPMD rank:0 INFO [1,0]:[2025-04-02 00:14:04,389] DEEPMD rank:0 INFO by "CUDA_VISIBLE_DEVICES=0,1 horovodrun -np 2 dp train --mpi-log=workers input.json"
the video memory occupies only a few hundred MiB, and the training time for each step is more than 10 seconds. The GPU usage can be more than 70% and the video memory can occupy more than 20,000 MiB when a single card is running, and each step only takes 8 seconds.
oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
la/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
la/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
la/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
orm/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
f2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
iler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
iler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
iler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
iler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
iler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
iler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
iler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
iler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
If you encounter the error 'an illegal memory access was encountered', this may be due to a TensorFlow issue. To avoid this, set the environment variable DP_INFER_BATCH_SIZE to a smaller value than the last adjusted batch size. The environment variable DP_INFER_BATCH_SIZE controls the inference batch size (nframes * natoms).
Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
If you encounter the error 'an illegal memory access was encountered', this may be due to a TensorFlow issue. To avoid this, set the environment variable DP_INFER_BATCH_SIZE to a smaller value than the last adjusted batch size. The environment variable DP_INFER_BATCH_SIZE controls the inference batch size (nframes * natoms).
iler/mlir/mlir_graph_optimization_pass.cc:382] MLIR V1 optimization pass is not enabled
Adjust batch size from 1024 to 2048
Adjust batch size from 2048 to 4096
iler/mlir/mlir_graph_optimization_pass.cc:382] MLIR V1 optimization pass is not enabled
Adjust batch size from 4096 to 8192
Adjust batch size from 8192 to 16384
Adjust batch size from 16384 to 32768
Adjust batch size from 1024 to 2048
Adjust batch size from 2048 to 4096
Adjust batch size from 4096 to 8192
Adjust batch size from 8192 to 16384
Adjust batch size from 32768 to 65536
Adjust batch size from 16384 to 32768
Adjust batch size from 65536 to 131072
Adjust batch size from 32768 to 65536
Adjust batch size from 65536 to 131072
Adjust batch size from 131072 to 262144
Adjust batch size from 131072 to 262144
training data with min nbor dist: 1.3186613041262718
training data with max nbor size: [59 60]
training data with min nbor dist: 1.3186613041262718
training data with max nbor size: [59 60]
_____ _____ __ __ _____ _ _ _
| __ \ | __ \ | / || __ \ | | ()| |
_____ _____ __ __ _____ _ _ _
| | | | ___ ___ | |__) || \ / || | | | ______ | | __ _ | |
| | | | / _ \ / _ | / | |/| || | | |||| |/ /| || __|
| __ \ | __ \ | / || __ \ | | ()| |
| || || /| /| | | | | || || | | < | || |
| | | | ___ ___ | |) || \ / || | | | ______ | | __ _ | |_
|/ _| _||| || |||_____/ ||_|| _|
| | | | / _ \ / _ | / | |/| || | | |||| |/ /| || |
Please read and cite:
| || || /| /| | | | | || || | | < | || |
Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
|/ _| _||| || |_||____/ ||_|| __|
Zeng et al, J. Chem. Phys., 159, 054801 (2023)
Please read and cite:
See https://deepmd.rtfd.io/credits/ for details.
Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
-----------------------------------------------------------------------------------------------
Zeng et al, J. Chem. Phys., 159, 054801 (2023)
installed to: /root/miniconda3/envs/deepmd/lib/python3.11/site-packages/deepmd
See https://deepmd.rtfd.io/credits/ for details.
source:
-----------------------------------------------------------------------------------------------
source branch:
source commit:
installed to: /root/miniconda3/envs/deepmd/lib/python3.11/site-packages/deepmd
source commit at:
use float prec: double
source:
build variant: cuda
source branch:
Backend: TensorFlow
source commit:
TF ver: v2.14.0-10-g99d80a9e254
source commit at:
build with TF ver: 2.14.1
use float prec: double
build with TF inc: /tmp/build-env-3kfdbo79/lib/python3.11/site-packages/tensorflow/include/
build variant: cuda
/tmp/build-env-3kfdbo79/lib/python3.11/site-packages/tensorflow/include/
Backend: TensorFlow
build with TF lib: /tmp/build-env-3kfdbo79/lib/python3.11/site-packages/tensorflow
TF ver: v2.14.0-10-g99d80a9e254
world size: 2
build with TF ver: 2.14.1
node list: autodl-container-85bd408359-2d489d0e
build with TF inc: /tmp/build-env-3kfdbo79/lib/python3.11/site-packages/tensorflow/include/
running on: autodl-container-85bd408359-2d489d0e
/tmp/build-env-3kfdbo79/lib/python3.11/site-packages/tensorflow/include/
computing device: gpu:1
build with TF lib: /tmp/build-env-3kfdbo79/lib/python3.11/site-packages/tensorflow
CUDA_VISIBLE_DEVICES: 0,1
Count of visible GPUs: 2
world size: 2
num_intra_threads: 0
node list: autodl-container-85bd408359-2d489d0e
num_inter_threads: 0
running on: autodl-container-85bd408359-2d489d0e
-----------------------------------------------------------------------------------------------
computing device: gpu:0
CUDA_VISIBLE_DEVICES: 0,1
Count of visible GPUs: 2
num_intra_threads: 0
num_inter_threads: 0
-----------------------------------------------------------------------------------------------
---Summary of DataSystem: training -----------------------------------------------
---Summary of DataSystem: training -----------------------------------------------
found 1 system(s):
found 1 system(s):
system natoms bch_sz n_bch prob pbc
system natoms bch_sz n_bch prob pbc
../00.data/training_data 72 1 7200 1.000e+00 T
--------------------------------------------------------------------------------------
../00.data/training_data 72 1 7200 1.000e+00 T
--------------------------------------------------------------------------------------
---Summary of DataSystem: validation -----------------------------------------------
---Summary of DataSystem: validation -----------------------------------------------
found 1 system(s):
found 1 system(s):
system natoms bch_sz n_bch prob pbc
system natoms bch_sz n_bch prob pbc
../00.data/validation_data 72 1 1800 1.000e+00 T
../00.data/validation_data 72 1 1800 1.000e+00 T
--------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------
training without frame parameter
training without frame parameter
data stating... (this step may take long time)
data stating... (this step may take long time)
built lr
built lr
built network
Scale learning rate by coef: 2.000000
built network
Scale learning rate by coef: 2.000000
built training
/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79078 MB memory: -> device: 0, name: NVIDIA A800 80GB PCIe, pci bus id: 0000:52:00.0, compute capability: 8.0
built training
initialize model from scratch
/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79078 MB memory: -> device: 1, name: NVIDIA A800 80GB PCIe, pci bus id: 0000:d1:00.0, compute capability: 8.0
broadcast global variables to other tasks
receive global variables from task#0
start training at lr 1.00e-03 (== 1.00e-03), decay_step 5000, decay_rate 0.950006, final lr will be 3.51e-08
start training at lr 1.00e-03 (== 1.00e-03), decay_step 5000, decay_rate 0.950006, final lr will be 3.51e-08
0: trn: rmse = 3.77e+01, rmse_e = 5.07e-01, rmse_f = 1.19e+00, lr = 1.00e-03
0: val: rmse = 3.42e+01, rmse_e = 4.89e-01, rmse_f = 1.08e+00
/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79078 MB memory: -> device: 1, name: NVIDIA A800 80GB PCIe, pci bus id: 0000:d1:00.0, compute capability: 8.0
/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79078 MB memory: -> device: 0, name: NVIDIA A800 80GB PCIe, pci bus id: 0000:52:00.0, compute capability: 8.0
batch 1000: total wall time = 13.11 s
batch 1000: trn: rmse = 3.55e+00, rmse_e = 4.55e-02, rmse_f = 1.12e-01, lr = 1.00e-03
batch 1000: val: rmse = 1.72e+00, rmse_e = 4.61e-02, rmse_f = 5.45e-02
batch 1000: total wall time = 13.08 s
batch 2000: total wall time = 11.45 s
batch 2000: trn: rmse = 1.98e+00, rmse_e = 7.28e-02, rmse_f = 6.25e-02, lr = 1.00e-03
batch 2000: val: rmse = 1.84e+00, rmse_e = 7.44e-02, rmse_f = 5.80e-02
batch 2000: total wall time = 11.45 s

njzjz · 2025-06-05T17:41:53Z

njzjz
Jun 5, 2025
Maintainer

[1,1]:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
[1,0]:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.

Since you didn't control the threads, the possible reason is that CPU resources are competing between the two processors.

See #1284 (comment)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

problems about the Parallel training of multi-GPUs #4693

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

problems about the Parallel training of multi-GPUs #4693

Uh oh!

HRHTer Apr 1, 2025

Replies: 1 comment

Uh oh!

Uh oh!

njzjz Jun 5, 2025 Maintainer

HRHTer
Apr 1, 2025

njzjz
Jun 5, 2025
Maintainer