Replies: 1 comment
-
Since you didn't control the threads, the possible reason is that CPU resources are competing between the two processors. See #1284 (comment) |
Beta Was this translation helpful? Give feedback.
-
Since you didn't control the threads, the possible reason is that CPU resources are competing between the two processors. See #1284 (comment) |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I 've use two gpus for the parallel training by "CUDA_VISIBLE_DEVICES=0,1 horovodrun -np 2 dp train --mpi-log=workers input.json"
However, both GPUs occupy less than 50 percent, the video memory occupies only a few hundred MiB, and the training time for each step is more than 10 seconds. The GPU usage can be more than 70% and the video memory can occupy more than 20,000 MiB when a single card is running, and each step only takes 8 seconds.
the log can be seen as below:
2025-04-02 00:13:18.826635: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable
TF_ENABLE_ONEDNN_OPTS=0
.2025-04-02 00:13:18.863257: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-04-02 00:13:18.863282: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-04-02 00:13:18.863310: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-04-02 00:13:18.870034: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-04-02 00:13:19.668900: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[1,0]:2025-04-02 00:13:21.224831: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable
TF_ENABLE_ONEDNN_OPTS=0
.[1,1]:2025-04-02 00:13:21.224831: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable
TF_ENABLE_ONEDNN_OPTS=0
.[1,0]:2025-04-02 00:13:21.260859: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
[1,0]:2025-04-02 00:13:21.260883: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[1,1]:2025-04-02 00:13:21.260858: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
[1,1]:2025-04-02 00:13:21.260883: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[1,0]:2025-04-02 00:13:21.260914: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[1,1]:2025-04-02 00:13:21.260914: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[1,0]:2025-04-02 00:13:21.268204: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
[1,0]:To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1,1]:2025-04-02 00:13:21.268202: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
[1,1]:To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1,0]:2025-04-02 00:13:21.985333: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[1,1]:2025-04-02 00:13:21.985417: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[1,1]:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
[1,0]:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
[1,0]:[2025-04-02 00:13:26,369] DEEPMD rank:0 INFO Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
[1,0]:[2025-04-02 00:13:26,458] DEEPMD rank:0 INFO If you encounter the error 'an illegal memory access was encountered', this may be due to a TensorFlow issue. To avoid this, set the environment variable DP_INFER_BATCH_SIZE to a smaller value than the last adjusted batch size. The environment variable DP_INFER_BATCH_SIZE controls the inference batch size (nframes * natoms).
[1,1]:[2025-04-02 00:13:26,498] DEEPMD rank:1 INFO Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
[1,1]:[2025-04-02 00:13:26,589] DEEPMD rank:1 INFO If you encounter the error 'an illegal memory access was encountered', this may be due to a TensorFlow issue. To avoid this, set the environment variable DP_INFER_BATCH_SIZE to a smaller value than the last adjusted batch size. The environment variable DP_INFER_BATCH_SIZE controls the inference batch size (nframes * natoms).
[1,0]:2025-04-02 00:13:26.595129: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:382] MLIR V1 optimization pass is not enabled
[1,0]:[2025-04-02 00:13:26,716] DEEPMD rank:0 INFO Adjust batch size from 1024 to 2048
[1,0]:[2025-04-02 00:13:26,727] DEEPMD rank:0 INFO Adjust batch size from 2048 to 4096
[1,1]:2025-04-02 00:13:26.740940: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:382] MLIR V1 optimization pass is not enabled
[1,0]:[2025-04-02 00:13:26,751] DEEPMD rank:0 INFO Adjust batch size from 4096 to 8192
[1,0]:[2025-04-02 00:13:26,791] DEEPMD rank:0 INFO Adjust batch size from 8192 to 16384
[1,0]:[2025-04-02 00:13:26,984] DEEPMD rank:0 INFO Adjust batch size from 16384 to 32768
[1,1]:[2025-04-02 00:13:27,063] DEEPMD rank:1 INFO Adjust batch size from 1024 to 2048
[1,1]:[2025-04-02 00:13:27,074] DEEPMD rank:1 INFO Adjust batch size from 2048 to 4096
[1,1]:[2025-04-02 00:13:27,262] DEEPMD rank:1 INFO Adjust batch size from 4096 to 8192
[1,1]:[2025-04-02 00:13:27,560] DEEPMD rank:1 INFO Adjust batch size from 8192 to 16384
[1,0]:[2025-04-02 00:13:27,561] DEEPMD rank:0 INFO Adjust batch size from 32768 to 65536
[1,1]:[2025-04-02 00:13:27,771] DEEPMD rank:1 INFO Adjust batch size from 16384 to 32768
[1,0]:[2025-04-02 00:13:28,765] DEEPMD rank:0 INFO Adjust batch size from 65536 to 131072
[1,1]:[2025-04-02 00:13:28,765] DEEPMD rank:1 INFO Adjust batch size from 32768 to 65536
[1,1]:[2025-04-02 00:13:29,967] DEEPMD rank:1 INFO Adjust batch size from 65536 to 131072
[1,0]:[2025-04-02 00:13:31,172] DEEPMD rank:0 INFO Adjust batch size from 131072 to 262144
[1,1]:[2025-04-02 00:13:32,668] DEEPMD rank:1 INFO Adjust batch size from 131072 to 262144
[1,0]:[2025-04-02 00:13:35,272] DEEPMD rank:0 INFO training data with min nbor dist: 1.3186613041262718
[1,0]:[2025-04-02 00:13:35,273] DEEPMD rank:0 INFO training data with max nbor size: [59 60]
[1,1]:[2025-04-02 00:13:36,181] DEEPMD rank:1 INFO training data with min nbor dist: 1.3186613041262718
[1,1]:[2025-04-02 00:13:36,182] DEEPMD rank:1 INFO training data with max nbor size: [59 60]
[1,1]:[2025-04-02 00:13:36,185] DEEPMD rank:1 INFO _____ _____ __ __ _____ _ _ _
[1,1]:[2025-04-02 00:13:36,185] DEEPMD rank:1 INFO | __ \ | __ \ | / || __ \ | | ()| |
[1,0]:[2025-04-02 00:13:36,185] DEEPMD rank:0 INFO _____ _____ __ __ _____ _ _ _
[1,1]:[2025-04-02 00:13:36,185] DEEPMD rank:1 INFO | | | | ___ ___ | |__) || \ / || | | | ______ | | __ _ | |
[1,1]:[2025-04-02 00:13:36,185] DEEPMD rank:1 INFO | | | | / _ \ / _ | / | |/| || | | |||| |/ /| || __|
[1,0]:[2025-04-02 00:13:36,185] DEEPMD rank:0 INFO | __ \ | __ \ | / || __ \ | | ()| |
[1,1]:[2025-04-02 00:13:36,185] DEEPMD rank:1 INFO | || || /| /| | | | | || || | | < | || |
[1,0]:[2025-04-02 00:13:36,185] DEEPMD rank:0 INFO | | | | ___ ___ | |) || \ / || | | | ______ | | __ _ | |_
[1,1]:[2025-04-02 00:13:36,185] DEEPMD rank:1 INFO |/ _| _||| || |||_____/ ||_|| _|
[1,0]:[2025-04-02 00:13:36,185] DEEPMD rank:0 INFO | | | | / _ \ / _ | / | |/| || | | |||| |/ /| || |
[1,1]:[2025-04-02 00:13:36,185] DEEPMD rank:1 INFO Please read and cite:
[1,0]:[2025-04-02 00:13:36,185] DEEPMD rank:0 INFO | || || /| /| | | | | || || | | < | || |
[1,1]:[2025-04-02 00:13:36,185] DEEPMD rank:1 INFO Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
[1,0]:[2025-04-02 00:13:36,185] DEEPMD rank:0 INFO |/ _| _||| || |_||____/ ||_|| __|
[1,1]:[2025-04-02 00:13:36,185] DEEPMD rank:1 INFO Zeng et al, J. Chem. Phys., 159, 054801 (2023)
[1,0]:[2025-04-02 00:13:36,185] DEEPMD rank:0 INFO Please read and cite:
[1,1]:[2025-04-02 00:13:36,185] DEEPMD rank:1 INFO See https://deepmd.rtfd.io/credits/ for details.
[1,0]:[2025-04-02 00:13:36,185] DEEPMD rank:0 INFO Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
[1,1]:[2025-04-02 00:13:36,185] DEEPMD rank:1 INFO -----------------------------------------------------------------------------------------------
[1,0]:[2025-04-02 00:13:36,185] DEEPMD rank:0 INFO Zeng et al, J. Chem. Phys., 159, 054801 (2023)
[1,1]:[2025-04-02 00:13:36,185] DEEPMD rank:1 INFO installed to: /root/miniconda3/envs/deepmd/lib/python3.11/site-packages/deepmd
[1,0]:[2025-04-02 00:13:36,185] DEEPMD rank:0 INFO See https://deepmd.rtfd.io/credits/ for details.
[1,1]:[2025-04-02 00:13:36,185] DEEPMD rank:1 INFO source:
[1,0]:[2025-04-02 00:13:36,185] DEEPMD rank:0 INFO -----------------------------------------------------------------------------------------------
[1,1]:[2025-04-02 00:13:36,185] DEEPMD rank:1 INFO source branch:
[1,1]:[2025-04-02 00:13:36,185] DEEPMD rank:1 INFO source commit:
[1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO installed to: /root/miniconda3/envs/deepmd/lib/python3.11/site-packages/deepmd
[1,1]:[2025-04-02 00:13:36,186] DEEPMD rank:1 INFO source commit at:
[1,1]:[2025-04-02 00:13:36,186] DEEPMD rank:1 INFO use float prec: double
[1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO source:
[1,1]:[2025-04-02 00:13:36,186] DEEPMD rank:1 INFO build variant: cuda
[1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO source branch:
[1,1]:[2025-04-02 00:13:36,186] DEEPMD rank:1 INFO Backend: TensorFlow
[1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO source commit:
[1,1]:[2025-04-02 00:13:36,186] DEEPMD rank:1 INFO TF ver: v2.14.0-10-g99d80a9e254
[1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO source commit at:
[1,1]:[2025-04-02 00:13:36,186] DEEPMD rank:1 INFO build with TF ver: 2.14.1
[1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO use float prec: double
[1,1]:[2025-04-02 00:13:36,186] DEEPMD rank:1 INFO build with TF inc: /tmp/build-env-3kfdbo79/lib/python3.11/site-packages/tensorflow/include/
[1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO build variant: cuda
[1,1]:[2025-04-02 00:13:36,186] DEEPMD rank:1 INFO /tmp/build-env-3kfdbo79/lib/python3.11/site-packages/tensorflow/include/
[1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO Backend: TensorFlow
[1,1]:[2025-04-02 00:13:36,186] DEEPMD rank:1 INFO build with TF lib: /tmp/build-env-3kfdbo79/lib/python3.11/site-packages/tensorflow
[1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO TF ver: v2.14.0-10-g99d80a9e254
[1,1]:[2025-04-02 00:13:36,186] DEEPMD rank:1 INFO world size: 2
[1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO build with TF ver: 2.14.1
[1,1]:[2025-04-02 00:13:36,186] DEEPMD rank:1 INFO node list: autodl-container-85bd408359-2d489d0e
[1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO build with TF inc: /tmp/build-env-3kfdbo79/lib/python3.11/site-packages/tensorflow/include/
[1,1]:[2025-04-02 00:13:36,186] DEEPMD rank:1 INFO running on: autodl-container-85bd408359-2d489d0e
[1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO /tmp/build-env-3kfdbo79/lib/python3.11/site-packages/tensorflow/include/
[1,1]:[2025-04-02 00:13:36,186] DEEPMD rank:1 INFO computing device: gpu:1
[1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO build with TF lib: /tmp/build-env-3kfdbo79/lib/python3.11/site-packages/tensorflow
[1,1]:[2025-04-02 00:13:36,186] DEEPMD rank:1 INFO CUDA_VISIBLE_DEVICES: 0,1
[1,1]:[2025-04-02 00:13:36,186] DEEPMD rank:1 INFO Count of visible GPUs: 2
[1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO world size: 2
[1,1]:[2025-04-02 00:13:36,186] DEEPMD rank:1 INFO num_intra_threads: 0
[1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO node list: autodl-container-85bd408359-2d489d0e
[1,1]:[2025-04-02 00:13:36,186] DEEPMD rank:1 INFO num_inter_threads: 0
[1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO running on: autodl-container-85bd408359-2d489d0e
[1,1]:[2025-04-02 00:13:36,186] DEEPMD rank:1 INFO -----------------------------------------------------------------------------------------------
[1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO computing device: gpu:0
[1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO CUDA_VISIBLE_DEVICES: 0,1
[1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO Count of visible GPUs: 2
[1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO num_intra_threads: 0
[1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO num_inter_threads: 0
[1,0]:[2025-04-02 00:13:36,186] DEEPMD rank:0 INFO -----------------------------------------------------------------------------------------------
[1,1]:[2025-04-02 00:13:36,235] DEEPMD rank:1 INFO ---Summary of DataSystem: training -----------------------------------------------
[1,0]:[2025-04-02 00:13:36,235] DEEPMD rank:0 INFO ---Summary of DataSystem: training -----------------------------------------------
[1,1]:[2025-04-02 00:13:36,235] DEEPMD rank:1 INFO found 1 system(s):
[1,0]:[2025-04-02 00:13:36,235] DEEPMD rank:0 INFO found 1 system(s):
[1,1]:[2025-04-02 00:13:36,235] DEEPMD rank:1 INFO system natoms bch_sz n_bch prob pbc
[1,0]:[2025-04-02 00:13:36,235] DEEPMD rank:0 INFO system natoms bch_sz n_bch prob pbc
[1,1]:[2025-04-02 00:13:36,235] DEEPMD rank:1 INFO ../00.data/training_data 72 1 7200 1.000e+00 T
[1,1]:[2025-04-02 00:13:36,235] DEEPMD rank:1 INFO --------------------------------------------------------------------------------------
[1,0]:[2025-04-02 00:13:36,235] DEEPMD rank:0 INFO ../00.data/training_data 72 1 7200 1.000e+00 T
[1,0]:[2025-04-02 00:13:36,235] DEEPMD rank:0 INFO --------------------------------------------------------------------------------------
[1,0]:[2025-04-02 00:13:36,243] DEEPMD rank:0 INFO ---Summary of DataSystem: validation -----------------------------------------------
[1,1]:[2025-04-02 00:13:36,243] DEEPMD rank:1 INFO ---Summary of DataSystem: validation -----------------------------------------------
[1,0]:[2025-04-02 00:13:36,243] DEEPMD rank:0 INFO found 1 system(s):
[1,1]:[2025-04-02 00:13:36,243] DEEPMD rank:1 INFO found 1 system(s):
[1,0]:[2025-04-02 00:13:36,243] DEEPMD rank:0 INFO system natoms bch_sz n_bch prob pbc
[1,1]:[2025-04-02 00:13:36,243] DEEPMD rank:1 INFO system natoms bch_sz n_bch prob pbc
[1,0]:[2025-04-02 00:13:36,243] DEEPMD rank:0 INFO ../00.data/validation_data 72 1 1800 1.000e+00 T
[1,1]:[2025-04-02 00:13:36,243] DEEPMD rank:1 INFO ../00.data/validation_data 72 1 1800 1.000e+00 T
[1,0]:[2025-04-02 00:13:36,243] DEEPMD rank:0 INFO --------------------------------------------------------------------------------------
[1,1]:[2025-04-02 00:13:36,243] DEEPMD rank:1 INFO --------------------------------------------------------------------------------------
[1,0]:[2025-04-02 00:13:36,243] DEEPMD rank:0 INFO training without frame parameter
[1,1]:[2025-04-02 00:13:36,243] DEEPMD rank:1 INFO training without frame parameter
[1,0]:[2025-04-02 00:13:36,243] DEEPMD rank:0 INFO data stating... (this step may take long time)
[1,1]:[2025-04-02 00:13:36,243] DEEPMD rank:1 INFO data stating... (this step may take long time)
[1,1]:[2025-04-02 00:13:36,349] DEEPMD rank:1 INFO built lr
[1,0]:[2025-04-02 00:13:36,357] DEEPMD rank:0 INFO built lr
[1,0]:[2025-04-02 00:13:37,086] DEEPMD rank:0 INFO built network
[1,0]:[2025-04-02 00:13:37,086] DEEPMD rank:0 INFO Scale learning rate by coef: 2.000000
[1,1]:[2025-04-02 00:13:37,093] DEEPMD rank:1 INFO built network
[1,1]:[2025-04-02 00:13:37,093] DEEPMD rank:1 INFO Scale learning rate by coef: 2.000000
[1,0]:[2025-04-02 00:13:38,400] DEEPMD rank:0 INFO built training
[1,0]:2025-04-02 00:13:38.570861: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79078 MB memory: -> device: 0, name: NVIDIA A800 80GB PCIe, pci bus id: 0000:52:00.0, compute capability: 8.0
[1,1]:[2025-04-02 00:13:38,600] DEEPMD rank:1 INFO built training
[1,0]:[2025-04-02 00:13:38,617] DEEPMD rank:0 INFO initialize model from scratch
[1,1]:2025-04-02 00:13:38.800087: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79078 MB memory: -> device: 1, name: NVIDIA A800 80GB PCIe, pci bus id: 0000:d1:00.0, compute capability: 8.0
[1,0]:[2025-04-02 00:13:38,947] DEEPMD rank:0 INFO broadcast global variables to other tasks
[1,1]:[2025-04-02 00:13:39,122] DEEPMD rank:1 INFO receive global variables from task#0
[1,1]:[2025-04-02 00:13:39,816] DEEPMD rank:1 INFO start training at lr 1.00e-03 (== 1.00e-03), decay_step 5000, decay_rate 0.950006, final lr will be 3.51e-08
[1,0]:[2025-04-02 00:13:39,852] DEEPMD rank:0 INFO start training at lr 1.00e-03 (== 1.00e-03), decay_step 5000, decay_rate 0.950006, final lr will be 3.51e-08
[1,0]:[2025-04-02 00:13:40,551] DEEPMD rank:0 INFO batch 0: trn: rmse = 3.77e+01, rmse_e = 5.07e-01, rmse_f = 1.19e+00, lr = 1.00e-03
[1,0]:[2025-04-02 00:13:40,552] DEEPMD rank:0 INFO batch 0: val: rmse = 3.42e+01, rmse_e = 4.89e-01, rmse_f = 1.08e+00
[1,1]:2025-04-02 00:13:40.916067: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79078 MB memory: -> device: 1, name: NVIDIA A800 80GB PCIe, pci bus id: 0000:d1:00.0, compute capability: 8.0
[1,0]:2025-04-02 00:13:41.548210: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79078 MB memory: -> device: 0, name: NVIDIA A800 80GB PCIe, pci bus id: 0000:52:00.0, compute capability: 8.0
[1,1]:[2025-04-02 00:13:52,929] DEEPMD rank:1 INFO batch 1000: total wall time = 13.11 s
[1,0]:[2025-04-02 00:13:52,936] DEEPMD rank:0 INFO batch 1000: trn: rmse = 3.55e+00, rmse_e = 4.55e-02, rmse_f = 1.12e-01, lr = 1.00e-03
[1,0]:[2025-04-02 00:13:52,936] DEEPMD rank:0 INFO batch 1000: val: rmse = 1.72e+00, rmse_e = 4.61e-02, rmse_f = 5.45e-02
[1,0]:[2025-04-02 00:13:52,936] DEEPMD rank:0 INFO batch 1000: total wall time = 13.08 s
[1,1]:[2025-04-02 00:14:04,382] DEEPMD rank:1 INFO batch 2000: total wall time = 11.45 s
[1,0]:[2025-04-02 00:14:04,389] DEEPMD rank:0 INFO batch 2000: trn: rmse = 1.98e+00, rmse_e = 7.28e-02, rmse_f = 6.25e-02, lr = 1.00e-03
[1,0]:[2025-04-02 00:14:04,389] DEEPMD rank:0 INFO batch 2000: val: rmse = 1.84e+00, rmse_e = 7.44e-02, rmse_f = 5.80e-02
[1,0]:[2025-04-02 00:14:04,389] DEEPMD rank:0 INFO batch 2000: total wall time = 11.45 s
Beta Was this translation helpful? Give feedback.
All reactions