Replies: 14 comments 24 replies
-
I see you compiled both serial and MPI versions. I don't know what you used, but the plugin should compile with the same MPI or without MPI, so it will not work for both versions. |
Beta Was this translation helpful? Give feedback.
-
Okay, then I focus on the serial and skip that last MPI step. The compiler says:
So it is compiled with serial c++. I load
Still, I get same error:
For reference, the following modules are active:
|
Beta Was this translation helpful? Give feedback.
-
Maybe they have different _GLIBCXX_USE_CXX11_ABI macros. You can use |
Beta Was this translation helpful? Give feedback.
-
Thank you, but I tried it out (using your fork), and the exact same error reappeared. I have added the logs from my compilations. Maybe this can help diagnose the problem. cmake_deepmd_err.log |
Beta Was this translation helpful? Give feedback.
-
Thanks, I tried again, and now we get a slightly more informative error:
cmake_deepmd_err.log |
Beta Was this translation helpful? Give feedback.
-
I'm so sorry for the long silence. I think the new version of the DeePMD-kit resolved this specific issue, as I can load the plugin now without problems. Tensorflow with ROCM:
DeePMD-kit:
Horovod:
LAMMPS with DeePMD-kit:
Create environment
TestsModel training
Running Lammps
Then I get:
To what extent this is an issue related to Slurm or TensorFlow, I am unsure. I have tried TensorFlow 2.11 and 2.9, but the error remained. Any tips? |
Beta Was this translation helpful? Give feedback.
-
Thank you, that worked! I am doing some performance testing, trying to replicate table III for se_e2_a_64c, which for the MI250X GPU of LUMI should be similar to the MI250 GPU of the paper. For 1 GPU, which I assume is how the benchmark is done, I get 7.1 microseconds/step/atom, which is about four times slower than 1.74 for the compressed version reported in the paper. Am I doing something sub-optimally when running lammps? The output for 64 GPUs is provided here: output. I am also a little worried about the leveling of performance. Is this system too small for linear scaling with the number of GPUs?
For a single GPU:
For 8 GPUs:
For 32:
For 64 GPUs:
Btw, for people who are interested in how I got it installed: Tensorflow with ROCM:
DeePMD-kit:
Horovod:
LAMMPS with DeePMD-kit:
Create environment
|
Beta Was this translation helpful? Give feedback.
-
Try adjusting |
Beta Was this translation helpful? Give feedback.
-
Hi! This is great information for building DeePMD + LAMMPS on an AMD GPU system. However ran in to a problem when compiling the DeePMD-kit libraries using
Which results in this error on linking
Any ideas on how to solve this? |
Beta Was this translation helpful? Give feedback.
-
Ok, I solved this problem. It is related to a conflict between the ROCM libtinfo and the libtinfo from the conda env. But now I encountered a new problem with related to the 'GeluCustom' op. I tried this: /Daniel |
Beta Was this translation helpful? Give feedback.
-
Hi @njzjz I removed the lines in the gelu_multi_device.cc file as you suggested. However after recompiling it still doesn't work :( Now when I try and run LAMMPS with my DeePMD potential which I previously trained (using the CUDA version on another system) I get the following error
I also tried to train a new potential using the ROCM version but I'm running into this error when using
and this error when using
Any ideas how to solve this? |
Beta Was this translation helpful? Give feedback.
-
Dear @njzjz , I'm facing troubles following the instructions given by @sigbjobo to properly install DeepMD-Kit in our HPE-Cray EX (similar to Lumi). I have not reached the full set of steps, but I have followed these so far:
So far so good. These above does not seem to give any problems. But problems appear in the DeepMD-Kit installation:
A first problem with this is that the compilation of the
For some reason, somewhere in the process a semicolon ( So, the first question here is:
Anyway, I separately and manually tried to execute the compilation command (obviously removing manually the nasty semicolon, and the compiler complained anyway about the
(Anyway, question 1 still applies, even if unsetting HIP_HIPCC_FLAGS bypasses the problem.) But there still is a second problem in the compilation command, which is the use of So, second question here is: Thanks. So far these are the questions. I will try to move forward and will come back if any problems that I can solved appear. Regards, |
Beta Was this translation helpful? Give feedback.
-
Hi, after LUMI was updated, the old installation stopped working. The problem is that the installation points to /opt/rocm as sketched in this issue, which has changed from rocm5 to rocm6. @njzjz, do you know how to ensure the deepmd points to the correct version of ROCM? I am okay with a workaround as here |
Beta Was this translation helpful? Give feedback.
-
For anyone wondering, this is working installation for LUMI after updating to rocm6:
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am trying to install DeePMD on LUMI, which is an AMD-based system. I have managed to install the DeePMD-kit python interface, but fail on the installation of LAMMPS. I making this post in the hope of getting a fully working installation on LUMI.
Tensorflow with ROCM:
DeePMD-kit:
Horovod:
DeePMD-kit libraries:
However, when I try to use the plugin mode I get:
Any help is much appreciated!
Beta Was this translation helpful? Give feedback.
All reactions