DeePMD + LAMMPS on LUMI #2523

sigbjobo · 2023-05-12T15:08:31Z

sigbjobo
May 12, 2023

I am trying to install DeePMD on LUMI, which is an AMD-based system. I have managed to install the DeePMD-kit python interface, but fail on the installation of LAMMPS. I making this post in the hope of getting a fully working installation on LUMI.

Tensorflow with ROCM:

#Install rccl
module load LUMI/22.08 partition/G
module load EasyBuild-user
eb rccl-2.12.7-cpeGNU-22.08.eb -r

#Make a virtual environment
module reset
module load LUMI/22.08 partition/G
module swap PrgEnv-cray PrgEnv-gnu
module load cray-python
module load rocm
module load rccl
python3 -m venv deepmd-env # needs to be after loading cray-python
source deepmd-env/bin/activate
LDFLAGS='-L/opt/rocm/lib -lamdhip64' pip install tensorflow-rocm

DeePMD-kit:

git clone --recursive https://github.com/deepmodeling/deepmd-kit.git deepmd-kit 
cd deepmd-kit

export DP_VARIANT=rocm
export HIP_HIPCC_FLAGS="--amdgpu-target=gfx90a" # For gfx90a
pip install .

Horovod:

# Install mpi4py first
LDFLAGS='-L/opt/rocm/lib -lamdhip64' MPICC=cc pip install mpi4py

# Install horovod, loading the same modules as before

export HOROVOD_WITHOUT_MXNET=1
export HOROVOD_WITHOUT_PYTORCH=1
export HOROVOD_GPU=ROCM
export HOROVOD_GPU_OPERATIONS=NCCL
export HOROVOD_WITHOUT_GLOO=1
export HOROVOD_WITH_TENSORFLOW=1
export HOROVOD_ROCM_PATH=$ROCM_PATH
export HOROVOD_RCCL_HOME=$EBROOTRCCL
export RCCL_INCLUDE_DIRS=$HOROVOD_RCCL_HOME/include
export HOROVOD_RCCL_LIB=$HOROVOD_RCCL_HOME/lib
export HOROVOD_MPICXX_SHOW="CC --cray-print-opts=all"
export HCC_AMDGPU_TARGET=gfx90a
CXX=CC pip install --verbose horovod==0.26.1

DeePMD-kit libraries:

#LAMMPS
wget https://github.com/lammps/lammps/archive/stable_23Jun2022_update2.tar.gz
rm -rf lammps-stable_23Jun2022_update2
tar xf stable_23Jun2022_update2.tar.gz
cd lammps-stable_23Jun2022_update2/
chmod -R ugo+rwx .
export LAMMPS_ROOT=$PWD
cd ../

#DeePMD-kit
cd deepmd-kit
export deepmd_root=$PWD/install
export deepmd_source=$PWD/source

cd $deepmd_source
# Version extraction was problematic, so I set manually
find . -type f | xargs -l sed -i  's#$(./lmp_version.sh)#20220623#g'

rm -rf build $deepmd_root
mkdir build
cd build
cmake -DCMAKE_INSTALL_PREFIX=$deepmd_root -DUSE_ROCM_TOOLKIT=TRUE -DUSE_TF_PYTHON_LIBS=TRUE -DLAMMPS_SOURCE_ROOT=$LAMMPS_ROOT -DLAMMPS_VERSION_NUMBER=20220623 -DCMAKE_BUILD_TYPE=Release ..


make -j 32
make install

cd $deepmd_root
PATH=$deepmd_root/bin:$PATH
PATH=$deepmd_root/lib:$PATH
export PATH

cd $LAMMPS_ROOT
rm -rf build
mkdir build
cd build
cmake -D PKG_PLUGIN=ON -D PKG_KSPACE=ON -D LAMMPS_INSTALL_RPATH=ON -DBUILD_SHARED_LIBS=yes -D CMAKE_INSTALL_PREFIX=${deepmd_root} -DCMAKE_INSTALL_LIBDIR=lib -DCMAKE_INSTALL_FULL_LIBDIR=${deepmd_root}/lib   -DDOWNLOAD_PLUMED=NO  -D BUILD_MPI=no   ../cmake
make -j16
make install
cp lmp lmp_serial

cmake -D PKG_PLUGIN=ON -D PKG_KSPACE=ON -D LAMMPS_INSTALL_RPATH=ON -DBUILD_SHARED_LIBS=yes -D CMAKE_INSTALL_PREFIX=${deepmd_root} -DCMAKE_INSTALL_LIBDIR=lib -DCMAKE_INSTALL_FULL_LIBDIR=${deepmd_root}/lib   -DDOWNLOAD_PLUMED=NO  -D BUILD_MPI=yes   ../cmake
make -j16
make install
cp lmp lmp_mpi

However, when I try to use the plugin mode I get:

Open of file libdeepmd_lmp.so failed: XXXXXl/deepmd-kit/install/lib/libdeepmd_lmp.so: undefined symbol: _ZN9LAMMPS_NS5utils11getsyserrorEv
ERROR: Unrecognized pair style 'deepmd' (src/force.cpp:279)

Is there something I am doing wrong? 2) Is it not possible to use python -DUSE_TF_PYTHON_LIBS=TRUE and then compile lammps?

Any help is much appreciated!

njzjz · 2023-05-12T19:07:33Z

njzjz
May 12, 2023
Maintainer

I see you compiled both serial and MPI versions. I don't know what you used, but the plugin should compile with the same MPI or without MPI, so it will not work for both versions.

0 replies

sigbjobo · 2023-05-14T15:50:55Z

sigbjobo
May 14, 2023
Author

Okay, then I focus on the serial and skip that last MPI step.

The compiler says:

-- C++ Compiler:     /opt/cray/pe/gcc/11.2.0/bin/c++
      Type:          GNU
      Version:       11.2.0
      C++ Flags:     -O2 -g -DNDEBUG
      Defines:       LAMMPS_SMALLBIG;LAMMPS_MEMALIGN=64;LAMMPS_OMP_COMPAT=4;LAMMPS_GZIP;FFT_KISS;LMP_PLUGIN
-- <<< Linker flags: >>>

So it is compiled with serial c++.

I load

export LAMMPS_PLUGIN_PATH=$deepmd_root/lib/deepmd_lmp:${LAMMPS_PLUGIN_PATH}
export LD_LIBRARY_PATH=$(python -c 'import tensorflow; print(tensorflow.sysconfig.get_lib())'):${LD_LIBRARY_PATH}

Still, I get same error:

Open of file libdeepmd_lmp.so failed: XXXX/boresigb/software/deepmd_install/deepmd-kit/install/lib/libdeepmd_lmp.so: undefined symbol: _ZN9LAMMPS_NS5utils11getsyserrorEv

For reference, the following modules are active:

 1) ModuleLabel/label (S)   6) craype-accel-amd-gfx90a               (H)  11) perftools-base/22.06.0  16) cray-libsci/22.08.1.1
  2) lumi-tools/23.04  (S)   7) libfabric/1.15.2.0                         12) gcc/11.2.0              17) cray-dsmml/0.2.2
  3) init-lumi/0.2     (S)   8) craype-network-ofi                    (H)  13) cray-python/3.9.12.1    18) cpeGNU/22.08
  4) LUMI/22.08        (S)   9) xpmem/2.5.2-2.4_3.20__gd0f7936.shasta      14) craype/2.7.17           19) rocm/5.2.3               (H)
  5) craype-x86-trento (H)  10) partition/G                           (S)  15) cray-mpich/8.1.18       20) rccl/2.12.7-cpeGNU-22.08

0 replies

njzjz · 2023-05-14T19:21:41Z

njzjz
May 14, 2023
Maintainer

Maybe they have different _GLIBCXX_USE_CXX11_ABI macros. You can use make VERBOSE=1 to check detailed flags.

1 reply

njzjz May 14, 2023
Maintainer

I submit #2527, which will use the default _GLIBCXX_USE_CXX11_ABI for the LAMMPS plugin instead of that from the TensorFlow library.

sigbjobo · 2023-05-15T04:54:00Z

sigbjobo
May 15, 2023
Author

Thank you, but I tried it out (using your fork), and the exact same error reappeared. I have added the logs from my compilations. Maybe this can help diagnose the problem.

cmake_deepmd_err.log
cmake_deepmd.log
cmake_lmp_err.log
cmake_lmp.log
make_deepmd_err.log
make_deepmd_install_err.log
make_deepmd_install.log
make_deepmd.log
make_lmp_err.log
make_lmp_install_err.log
make_lmp_install.log
make_lmp.log

1 reply

njzjz May 15, 2023
Maintainer

I found this command:

cd /project/XXXXX/software/deepmd_install/deepmd-kit/source/build/lmp/plugin && /opt/cray/pe/craype/2.7.17/bin/CC -DLAMMPS_VERSION_NUMBER=20220623 -DLMPPLUGIN -DTENSORFLOW_USE_ROCM -D_GLIBCXX_USE_CXX11_ABI=0 -D__HIP_PLATFORM_HCC__ -Ddeepmd_lmp_EXPORTS -I/project/XXXXX/software/deepmd_install/deepmd-kit/source/build/lmp/plugin -I/project/XXXXX/software/deepmd_install/deepmd-kit/source/lmp/plugin/.. -I/project/XXXXX/software/deepmd_install/lammps-stable_23Jun2022_update2/src/PLUGIN -I/project/XXXXX/software/deepmd_install/lammps-stable_23Jun2022_update2/src/KSPACE -I/project/XXXXX/software/deepmd_install/lammps-stable_23Jun2022_update2/src -I/project/XXXXX/software/deepmd_install/lammps-stable_23Jun2022_update2/src/EXTRA-FIX -I/project/XXXXX/software/deepmd_install/deepmd-kit/source/api_c/include -fopenmp -O3 -DNDEBUG -fPIC -std=gnu++17 -Winvalid-pch -include /project/XXXXX/software/deepmd_install/deepmd-kit/source/build/lmp/plugin/CMakeFiles/deepmd_lmp.dir/cmake_pch.hxx -MD -MT lmp/plugin/CMakeFiles/deepmd_lmp.dir/__/fix_dplr.cpp.o -MF CMakeFiles/deepmd_lmp.dir/__/fix_dplr.cpp.o.d -o CMakeFiles/deepmd_lmp.dir/__/fix_dplr.cpp.o -c /project/XXXXX/software/deepmd_install/deepmd-kit/source/lmp/fix_dplr.cpp

It used -D_GLIBCXX_USE_CXX11_ABI=0 as your TensorFlow is also built with this flag. Usually, the default value is 1. You need to pass -D_GLIBCXX_USE_CXX11_ABI=0 to lammps (try cmake -D CMAKE_CXX_FLAGS="-D_GLIBCXX_USE_CXX11_ABI=0").

sigbjobo · 2023-05-15T05:49:10Z

sigbjobo
May 15, 2023
Author

Thanks, I tried again, and now we get a slightly more informative error:

Open of file libdeepmd_lmp.so failed: /project/XXXX/software/deepmd_install/deepmd-kit/install/lib/libdeepmd_lmp.so: undefined symbol: _ZN9LAMMPS_NS5Error4_allERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEiN3fmt6v8_lmp17basic_string_viewIcEENSA_17basic_format_argsINSA_20basic_format_contextINSA_8appenderEcEEEE

cmake_deepmd_err.log
cmake_deepmd.log
cmake_lmp_err.log
cmake_lmp.log
make_deepmd_err.log
make_deepmd_install_err.log
make_deepmd_install.log
make_deepmd.log
make_lmp_err.log
make_lmp_install_err.log
make_lmp_install.log
make_lmp.log

1 reply

njzjz May 15, 2023
Maintainer

I think there are still two problems that may matter:
(1) deepmd uses -std=c++17 but lammps uses -std=c++11
(2) deepmd finds MPI but lammps does not compile with MPI, as shown in cmake_deepmd.log

-- Found MPI_C: /opt/cray/pe/craype/2.7.17/bin/cc (found version "3.1") 
-- Found MPI_CXX: /opt/cray/pe/craype/2.7.17/bin/CC (found version "3.1") 
-- Found MPI: TRUE (found version "3.1")

sigbjobo · 2023-05-25T05:38:12Z

sigbjobo
May 25, 2023
Author

I'm so sorry for the long silence. I think the new version of the DeePMD-kit resolved this specific issue, as I can load the plugin now without problems.

Tensorflow with ROCM:

#Install rccl
module load LUMI/22.08 partition/G
module load EasyBuild-user
eb rccl-2.12.7-cpeGNU-22.08.eb -r

#Make a virtual environment
module reset
module load LUMI/22.08 partition/G
module swap PrgEnv-cray PrgEnv-gnu
module load cray-python
module load rocm
module load rccl
python3 -m venv deepmd-env # needs to be after loading cray-python
source deepmd-env/bin/activate
LDFLAGS='-L/opt/rocm/lib -lamdhip64' pip install tensorflow-rocm==2.11.0.*

DeePMD-kit:

git clone --recursive https://github.com/deepmodeling/deepmd-kit.git deepmd-kit 
cd deepmd-kit

export DP_VARIANT=rocm
export HIP_HIPCC_FLAGS="--amdgpu-target=gfx90a" 
pip install .

Horovod:

# Install mpi4py first
LDFLAGS='-L/opt/rocm/lib -lamdhip64' MPICC=cc pip install mpi4py

# Install horovod, loading the same modules as before

export HOROVOD_WITHOUT_MXNET=1
export HOROVOD_WITHOUT_PYTORCH=1
export HOROVOD_GPU=ROCM
export HOROVOD_GPU_OPERATIONS=NCCL
export HOROVOD_WITHOUT_GLOO=1
export HOROVOD_WITH_TENSORFLOW=1
export HOROVOD_ROCM_PATH=$ROCM_PATH
export HOROVOD_RCCL_HOME=$EBROOTRCCL
export RCCL_INCLUDE_DIRS=$HOROVOD_RCCL_HOME/include
export HOROVOD_RCCL_LIB=$HOROVOD_RCCL_HOME/lib
export HOROVOD_MPICXX_SHOW="CC --cray-print-opts=all"
export HCC_AMDGPU_TARGET=gfx90a
CXX=CC pip install --verbose horovod==0.26.1

LAMMPS with DeePMD-kit:

#LAMMPS
wget https://github.com/lammps/lammps/archive/stable_23Jun2022_update2.tar.gz
rm -rf lammps-stable_23Jun2022_update2
tar xf stable_23Jun2022_update2.tar.gz
cd lammps-stable_23Jun2022_update2/
chmod -R ugo+rwx .
export LAMMPS_ROOT=$PWD
cd ../

#DeePMD-kit
cd deepmd-kit
export deepmd_root=$PWD/install
export deepmd_source=$PWD/source

cd $deepmd_source
# Version extraction was problematic, so I set manually
find . -type f | xargs -l sed -i  's#$(./lmp_version.sh)#20220623#g'

rm -rf build $deepmd_root
mkdir build
cd build
cmake -DCMAKE_INSTALL_PREFIX=$deepmd_root -DUSE_ROCM_TOOLKIT=TRUE -DUSE_TF_PYTHON_LIBS=TRUE -DLAMMPS_SOURCE_ROOT=$LAMMPS_ROOT -DLAMMPS_VERSION_NUMBER=20220623 -DCMAKE_BUILD_TYPE=Release ..


make -j 32
make install

cd $deepmd_root
PATH=$deepmd_root/bin:$PATH
PATH=$deepmd_root/lib:$PATH
export PATH

cd $LAMMPS_ROOT
rm -rf build
mkdir build
cd build
CXX=CC cmake -D PKG_PLUGIN=ON -D PKG_KSPACE=ON -D LAMMPS_INSTALL_RPATH=ON -DBUILD_SHARED_LIBS=yes -D CMAKE_INSTALL_PREFIX=${deepmd_root} -DCMAKE_INSTALL_LIBDIR=lib -DCMAKE_INSTALL_FULL_LIBDIR=${deepmd_root}/lib   -DDOWNLOAD_PLUMED=NO  -D BUILD_MPI=no   ../cmake
make -j16
make install
cp lmp lmp_serial

CXX=CC cmake -D PKG_PLUGIN=ON -D PKG_KSPACE=ON -D LAMMPS_INSTALL_RPATH=ON -DBUILD_SHARED_LIBS=yes -D CMAKE_INSTALL_PREFIX=${deepmd_root} -DCMAKE_INSTALL_LIBDIR=lib -DCMAKE_INSTALL_FULL_LIBDIR=${deepmd_root}/lib   -DDOWNLOAD_PLUMED=NO  -D BUILD_MPI=yes   ../cmake
make -j16
make install
cp lmp lmp_mpi

Create environment

cd $LAMMPS_ROOT/../

echo """module reset
module load LUMI/22.08 partition/G
module swap PrgEnv-cray PrgEnv-gnu
module load cray-python
module load rocm
module load rccl
source $PWD/deepmd-env/bin/activate
export LAMMPS_PLUGIN_PATH=$PWD/lib/deepmd_lmp:\${LAMMPS_PLUGIN_PATH}
PATH=$PWD/deepmd-kit/install/bin/:"$PWD"/deepmd-kit/install//lib:\$PATH
export LD_LIBRARY_PATH=\$(python -c 'import tensorflow; print(tensorflow.sysconfig.get_lib())'):\${LD_LIBRARY_PATH}
"""  > env.sh

Tests

Model training

cd $LAMMPS_ROOT/../
salloc --nodes=1 --account=project_XXXX  --partition=dev-g  --time=00:30:00  --gpus-per-node=1
source env.sh
cd deepmd-kit/examples/water/se_e2_a
sed -i 's/1000000/1000/g' input.json
srun dp train input.json
srun dp freeze
srun dp compress

Running Lammps

cd ../water/lmp/ 
cp ../se_e2_a/frozen*.pb .
srun lmp -in in.plugin.lammps

Then I get:

2023-05-25 08:25:35.326952: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 58046 MB memory:  -> device: 0, name: , pci bus id: 0000:c1:00.0
2023-05-25 08:25:35.563763: F tensorflow/core/framework/op.cc:215] Non-OK-status: RegisterAlreadyLocked(deferred_[i]) status: ALREADY_EXISTS: Op with name Gelu
srun: error: nid007567: task 0: Aborted
srun: launch/slurm: _step_signal: Terminating StepId=3581028.9

To what extent this is an issue related to Slurm or TensorFlow, I am unsure. I have tried TensorFlow 2.11 and 2.9, but the error remained. Any tips?

1 reply

njzjz May 25, 2023
Maintainer

See #2223

sigbjobo · 2023-06-27T09:22:52Z

sigbjobo
Jun 27, 2023
Author

Thank you, that worked!

I am doing some performance testing, trying to replicate table III for se_e2_a_64c, which for the MI250X GPU of LUMI should be similar to the MI250 GPU of the paper. For 1 GPU, which I assume is how the benchmark is done, I get 7.1 microseconds/step/atom, which is about four times slower than 1.74 for the compressed version reported in the paper. Am I doing something sub-optimally when running lammps? The output for 64 GPUs is provided here: output. I am also a little worried about the leveling of performance. Is this system too small for linear scaling with the number of GPUs?

### (Assuming deepmd has been loaded)

mkdir -p test
cd test
### Download model
wget https://github.com/deepmodeling-activity/deepmd-kit-v2-paper/raw/main/models/03/frozen_model.pb
dp convert-from -i frozen_model.pb -o graph.pb
dp compress -i graph.pb -o graph_compressed.pb
mv graph_compressed.pb frozen_model.pb

### Download lammps input
wget https://github.com/deepmodeling-activity/deepmd-kit-v2-paper/raw/main/models/water.in
wget https://github.com/deepmodeling-activity/deepmd-kit-v2-paper/raw/main/models/water.lmp
sed -i 's#../water.lmp#water.lmp#g' water.in

### Run lammps
salloc --nodes=1 --account=project_XXXXXXXX  --partition=dev-g  --time=00:30:00  --gpus-per-node=8 --ntasks-per-node=8

### Load DEEPMD first

### LUMI specific: https://docs.lumi-supercomputer.eu/runjobs/scheduled-jobs/lumig-job/
cat << EOF > select_gpu
#!/bin/bash

export ROCR_VISIBLE_DEVICES=\$SLURM_LOCALID
exec \$*
EOF
chmod +x ./select_gpu

export OMP_NUM_THREADS=1
export TF_INTRA_OP_PARALLELISM_THREADS=1
export TF_INTER_OP_PARALLELISM_THREADS=1
export MPICH_GPU_SUPPORT_ENABLED=1

srun -n1 ./select_gpu lmp -in water.in

For a single GPU:

Performance: 0.495 ns/day, 48.514 hours/ns, 11.451 timesteps/s
112.7% CPU use with 1 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 42.919     | 42.919     | 42.919     |   0.0 | 98.49
Neigh   | 0.57837    | 0.57837    | 0.57837    |   0.0 |  1.33
Comm    | 0.030451   | 0.030451   | 0.030451   |   0.0 |  0.07
Output  | 0.0016457  | 0.0016457  | 0.0016457  |   0.0 |  0.00
Modify  | 0.039294   | 0.039294   | 0.039294   |   0.0 |  0.09
Other   |            | 0.006824   |            |       |  0.02

For 8 GPUs:

Performance: 2.670 ns/day, 8.988 hours/ns, 61.813 timesteps/s
121.9% CPU use with 8 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 7.4369     | 7.793      | 7.895      |   5.1 | 96.53
Neigh   | 0.069205   | 0.069658   | 0.070071   |   0.1 |  0.86
Comm    | 0.088347   | 0.19084    | 0.55132    |  32.8 |  2.36
Output  | 0.00098157 | 0.0010792  | 0.0012341  |   0.3 |  0.01
Modify  | 0.0051108  | 0.008179   | 0.0090884  |   1.3 |  0.10
Other   |            | 0.009965   |            |       |  0.12

For 32:

Performance: 5.659 ns/day, 4.241 hours/ns, 130.999 timesteps/s
132.6% CPU use with 32 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 3.5418     | 3.6271     | 3.6996     |   2.3 | 95.22
Neigh   | 0.017185   | 0.017556   | 0.018064   |   0.2 |  0.46
Comm    | 0.082004   | 0.15449    | 0.2407     |  11.0 |  4.06
Output  | 0.0037054  | 0.003862   | 0.0039689  |   0.1 |  0.10
Modify  | 0.0015383  | 0.0020008  | 0.0024014  |   0.5 |  0.05
Other   |            | 0.004181   |            |       |  0.11

For 64 GPUs:

Performance: 7.658 ns/day, 3.134 hours/ns, 177.266 timesteps/s
134.6% CPU use with 64 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 2.6008     | 2.6449     | 2.7271     |   1.9 | 93.96
Neigh   | 0.0086954  | 0.0089974  | 0.0098914  |   0.2 |  0.32
Comm    | 0.070264   | 0.15235    | 0.19663    |   8.0 |  5.41
Output  | 0.0041094  | 0.0044192  | 0.0046018  |   0.2 |  0.16
Modify  | 0.00092054 | 0.0011745  | 0.0014495  |   0.3 |  0.04
Other   |            | 0.003101   |            |       |  0.11

Btw, for people who are interested in how I got it installed:

Tensorflow with ROCM:

#Install rccl
module load LUMI/22.08 partition/G
module load EasyBuild-user
eb rccl-2.12.7-cpeGNU-22.08.eb -r

#Make a virtual environment
module reset
module load LUMI/22.08 partition/G
module swap PrgEnv-cray PrgEnv-gnu
module load cray-python
module load rocm
module load rccl
python3 -m venv deepmd-env # needs to be after loading cray-python
source deepmd-env/bin/activate
LDFLAGS='-L/opt/rocm/lib -lamdhip64' pip install tensorflow-rocm==2.8.*

DeePMD-kit:

git clone --recursive https://github.com/deepmodeling/deepmd-kit.git deepmd-kit 
cd deepmd-kit
sed -i '/gelu_multi_device.cc/d' source/op/CMakeLists.txt
export DP_VARIANT=rocm
export HIP_HIPCC_FLAGS="--amdgpu-target=gfx90a" 
pip install .

Horovod:

# Install mpi4py first
LDFLAGS='-L/opt/rocm/lib -lamdhip64' MPICC=cc pip install mpi4py

# Install horovod, loading the same modules as before

export HOROVOD_WITHOUT_MXNET=1
export HOROVOD_WITHOUT_PYTORCH=1
export HOROVOD_GPU=ROCM
export HOROVOD_GPU_OPERATIONS=NCCL
export HOROVOD_WITHOUT_GLOO=1
export HOROVOD_WITH_TENSORFLOW=1
export HOROVOD_ROCM_PATH=$ROCM_PATH
export HOROVOD_RCCL_HOME=$EBROOTRCCL
export RCCL_INCLUDE_DIRS=$HOROVOD_RCCL_HOME/include
export HOROVOD_RCCL_LIB=$HOROVOD_RCCL_HOME/lib
export HOROVOD_MPICXX_SHOW="CC --cray-print-opts=all"
export HCC_AMDGPU_TARGET=gfx90a
CXX=CC pip install --verbose horovod==0.26.1

LAMMPS with DeePMD-kit:

#LAMMPS
wget https://github.com/lammps/lammps/archive/stable_23Jun2022_update2.tar.gz
rm -rf lammps-stable_23Jun2022_update2
tar xf stable_23Jun2022_update2.tar.gz
cd lammps-stable_23Jun2022_update2/
chmod -R ugo+rwx .
export LAMMPS_ROOT=$PWD
cd ../

#DeePMD-kit
cd deepmd-kit
export deepmd_root=$PWD/install
export deepmd_source=$PWD/source

cd $deepmd_source
# Version extraction was problematic, so I set manually
find . -type f | xargs -l sed -i  's#$(./lmp_version.sh)#20220623#g'

rm -rf build $deepmd_root
mkdir build
cd build
cmake -DCMAKE_INSTALL_PREFIX=$deepmd_root -DUSE_ROCM_TOOLKIT=TRUE -DUSE_TF_PYTHON_LIBS=TRUE -DLAMMPS_SOURCE_ROOT=$LAMMPS_ROOT -DLAMMPS_VERSION_NUMBER=20220623 -DCMAKE_BUILD_TYPE=Release -DHIP_HIPCC_FLAGS="--amdgpu-target=gfx90a" -DDP_VARIANT=rocm ..


make -j 32
make install

cd $deepmd_root
PATH=$deepmd_root/bin:$PATH
PATH=$deepmd_root/lib:$PATH
export PATH

cd $LAMMPS_ROOT
rm -rf build
mkdir build
cd build
CXX=CC cmake -D PKG_PLUGIN=ON -D PKG_KSPACE=ON -D LAMMPS_INSTALL_RPATH=ON -DBUILD_SHARED_LIBS=yes -D CMAKE_INSTALL_PREFIX=${deepmd_root} -DCMAKE_INSTALL_LIBDIR=lib -DCMAKE_INSTALL_FULL_LIBDIR=${deepmd_root}/lib   -DDOWNLOAD_PLUMED=NO  -D BUILD_MPI=no   ../cmake
make -j16
make install
cp lmp lmp_serial

CXX=CC cmake -D PKG_PLUGIN=ON -D PKG_KSPACE=ON -D LAMMPS_INSTALL_RPATH=ON -DBUILD_SHARED_LIBS=yes -D CMAKE_INSTALL_PREFIX=${deepmd_root} -DCMAKE_INSTALL_LIBDIR=lib -DCMAKE_INSTALL_FULL_LIBDIR=${deepmd_root}/lib   -DDOWNLOAD_PLUMED=NO  -D BUILD_MPI=yes   ../cmake
make -j16
make install
cp lmp lmp_mpi

Create environment

cd $LAMMPS_ROOT/../

echo """module reset
module load LUMI/22.08 partition/G
module swap PrgEnv-cray PrgEnv-gnu
module load cray-python
module load rocm
module load rccl
source $PWD/deepmd-env/bin/activate
export LAMMPS_PLUGIN_PATH=$PWD/lib/deepmd_lmp:\${LAMMPS_PLUGIN_PATH}
PATH=$PWD/deepmd-kit/install/bin/:"$PWD"/deepmd-kit/install//lib:\$PATH
export LD_LIBRARY_PATH=\$(python -c 'import tensorflow; print(tensorflow.sysconfig.get_lib())'):\${LD_LIBRARY_PATH}
"""  > env.sh

0 replies

njzjz · 2023-06-27T09:35:18Z

njzjz
Jun 27, 2023
Maintainer

Try adjusting replicate 4 4 4 to a larger number to see if the performance improves.

1 reply

sigbjobo Jun 27, 2023
Author

Sorry, I had a typo and used the uncompressed version. Now it indeed is 1.81 microseconds/step/atom.

Dankomaister · 2023-06-29T03:57:51Z

Dankomaister
Jun 29, 2023

Hi!

This is great information for building DeePMD + LAMMPS on an AMD GPU system.
I tried to follow this to build DeePMD + LAMMPS on Dardel (a Swedish HPC system similar to LUMI)

However ran in to a problem when compiling the DeePMD-kit libraries using

mkdir build
cd build
cmake -DCMAKE_INSTALL_PREFIX=$deepmd_root -DUSE_ROCM_TOOLKIT=TRUE -DUSE_TF_PYTHON_LIBS=TRUE -DLAMMPS_SOURCE_ROOT=$LAMMPS_ROOT -DLAMMPS_VERSION_NUMBER=20220623 -DCMAKE_BUILD_TYPE=Release -DHIP_HIPCC_FLAGS="--amdgpu-target=gfx90a" -DDP_VARIANT=rocm ..
make -j
make install

Which results in this error on linking

[ 96%] Built target deepmd_c
[ 97%] Linking CXX shared module libdeepmd_lmp_low.so
[ 98%] Linking CXX shared module libdeepmd_lmp.so
[ 98%] Linking CXX shared library libdeepmd_gromacs.so
[ 98%] Built target deepmd_lmp_low
[ 98%] Built target deepmd_lmp
[ 98%] Built target deepmd_gromacs
[ 99%] Linking CXX executable dp_ipi
[100%] Linking CXX executable dp_ipi_low
/usr/bin/ld: /opt/rocm-5.0.2/lib64/libamd_comgr.so: undefined reference to `set_curterm@NCURSES6_TINFO_5.0.19991023'
/usr/bin/ld: /opt/rocm-5.0.2/lib64/libamd_comgr.so: undefined reference to `del_curterm@NCURSES6_TINFO_5.0.19991023'
/usr/bin/ld: /opt/rocm-5.0.2/lib64/libamd_comgr.so: undefined reference to `setupterm@NCURSES6_TINFO_5.0.19991023'
/usr/bin/ld: /opt/rocm-5.0.2/lib64/libamd_comgr.so: undefined reference to `tigetnum@NCURSES6_TINFO_5.0.19991023'
collect2: error: ld returned 1 exit status
make[2]: *** [ipi/CMakeFiles/dp_ipi.dir/build.make:121: ipi/dp_ipi] Error 1
make[1]: *** [CMakeFiles/Makefile2:519: ipi/CMakeFiles/dp_ipi.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
/usr/bin/ld: /opt/rocm-5.0.2/lib64/libamd_comgr.so: undefined reference to `set_curterm@NCURSES6_TINFO_5.0.19991023'
/usr/bin/ld: /opt/rocm-5.0.2/lib64/libamd_comgr.so: undefined reference to `del_curterm@NCURSES6_TINFO_5.0.19991023'
/usr/bin/ld: /opt/rocm-5.0.2/lib64/libamd_comgr.so: undefined reference to `setupterm@NCURSES6_TINFO_5.0.19991023'
/usr/bin/ld: /opt/rocm-5.0.2/lib64/libamd_comgr.so: undefined reference to `tigetnum@NCURSES6_TINFO_5.0.19991023'
collect2: error: ld returned 1 exit status
make[2]: *** [ipi/CMakeFiles/dp_ipi_low.dir/build.make:121: ipi/dp_ipi_low] Error 1
make[1]: *** [CMakeFiles/Makefile2:574: ipi/CMakeFiles/dp_ipi_low.dir/all] Error 2
make: *** [Makefile:136: all] Error 2

Any ideas on how to solve this?

2 replies

sigbjobo Jun 29, 2023
Author

I have never seen this error. But is the exact same gpu as on lumi, so should in principle work. Did the first steps sucseed?

njzjz Jun 29, 2023
Maintainer

Please see ROCm/ROCm#2121. It seems a dependency is missing when installing rocm.

Dankomaister · 2023-08-14T08:53:27Z

Dankomaister
Aug 14, 2023

Ok, I solved this problem.

It is related to a conflict between the ROCM libtinfo and the libtinfo from the conda env.
Removing libtinfo from conda env solved the problem!

But now I encountered a new problem with related to the 'GeluCustom' op.
I want to use a DeePMD potential that I have trained previously (https://doi.org/10.21203/rs.3.rs-3197610/v1)
Here I used the 'GeluCustom' op as it is significantly faster which is crucial for these simulations.
So how can I use a potential trained with the 'GeluCustom' op in the ROCM version of DeePMD?

I tried this:
sed -i '/gelu_multi_device.cc/d' source/op/CMakeLists.txt
but then deepmd complains that there is no "GeluCustom" op.
Compiling with gelu_multi_device.cc gives the
Non-OK-status: RegisterAlreadyLocked(deferred_[i]) status: ALREADY_EXISTS: Op with name Gelu
error.

/Daniel

5 replies

Dankomaister Aug 22, 2023

@njzjz perhaps this should be considered as a bug?
seems like it is not possible to use Gelu or GeluCustom at all for the ROCM version.

/Daniel

njzjz Aug 22, 2023
Maintainer

This is not a bug. The official TensorFlow does not have a Gelu Op. The AMD's branch adds it, which causes conflicts. There is no way to distinguish the official one and the AMD's customized one, so the official one is assumed to be installed.
The workaround is to comment the definition of Gelu but leave GeluCustom available.

Dankomaister Aug 22, 2023

@njzjz okay! how does one "comment the definition of Gelu but leave GeluCustom available" ?
I have tried many things but nothing seems to work :(
Have also tried to retrain my potential using the ROCM version with both gelu and gelu_tf activation functions in DeePMD but neither worked :(

@sigbjobo have you gotten gelu to work with your version on LUMI?

/Daniel

sigbjobo Aug 22, 2023
Author

Hi, I as said above in this thread, everything works on lumi with the latest guide that I provided in this thread.

njzjz Aug 22, 2023
Maintainer

Remove these lines:

deepmd-kit/source/op/gelu_multi_device.cc

Lines 5 to 21 in 53a1078

    
           REGISTER_OP("Gelu") 
        
               .Attr("T: {float, double} = DT_DOUBLE") 
        
               .Input("x: T") 
        
               .Output("output: T"); 
        
           REGISTER_OP("GeluGrad") 
        
               .Attr("T: {float, double} = DT_DOUBLE") 
        
               .Input("dy: T") 
        
               .Input("x: T") 
        
               .Output("output: T"); 
        
           REGISTER_OP("GeluGradGrad") 
        
               .Attr("T: {float, double} = DT_DOUBLE") 
        
               .Input("dy: T") 
        
               .Input("dy_: T") 
        
               .Input("x: T") 
        
               .Output("output: T");

deepmd-kit/source/op/gelu_multi_device.cc

Lines 174 to 182 in 53a1078

    
           REGISTER_KERNEL_BUILDER(                                                  \ 
        
               Name("Gelu").Device(DEVICE_CPU).TypeConstraint<T>("T"),               \ 
        
               GeluOp<CPUDevice, T>);                                                \ 
        
           REGISTER_KERNEL_BUILDER(                                                  \ 
        
               Name("GeluGrad").Device(DEVICE_CPU).TypeConstraint<T>("T"),           \ 
        
               GeluGradOp<CPUDevice, T>);                                            \ 
        
           REGISTER_KERNEL_BUILDER(                                                  \ 
        
               Name("GeluGradGrad").Device(DEVICE_CPU).TypeConstraint<T>("T"),       \ 
        
               GeluGradGradOp<CPUDevice, T>);                                        \

deepmd-kit/source/op/gelu_multi_device.cc

Lines 196 to 205 in 53a1078

    
           #define REGISTER_GPU(T)                                                     \ 
        
             REGISTER_KERNEL_BUILDER(                                                  \ 
        
                 Name("Gelu").Device(DEVICE_GPU).TypeConstraint<T>("T"),               \ 
        
                 GeluOp<GPUDevice, T>);                                                \ 
        
             REGISTER_KERNEL_BUILDER(                                                  \ 
        
                 Name("GeluGrad").Device(DEVICE_GPU).TypeConstraint<T>("T"),           \ 
        
                 GeluGradOp<GPUDevice, T>);                                            \ 
        
             REGISTER_KERNEL_BUILDER(                                                  \ 
        
                 Name("GeluGradGrad").Device(DEVICE_GPU).TypeConstraint<T>("T"),       \ 
        
                 GeluGradGradOp<GPUDevice, T>);                                        \

Dankomaister · 2023-09-06T04:48:52Z

Dankomaister
Sep 6, 2023

Hi @njzjz

I removed the lines in the gelu_multi_device.cc file as you suggested. However after recompiling it still doesn't work :(

Now when I try and run LAMMPS with my DeePMD potential which I previously trained (using the CUDA version on another system) I get the following error

INVALID_ARGUMENT: No OpKernel was registered to support Op 'GeluCustom' used by {{node type_embed_net/GeluCustom}} with these attrs: [T=DT_FLOAT]
Registered devices: [CPU, GPU]
Registered kernels:
  <no registered kernels>

	 [[type_embed_net/GeluCustom]]
INVALID_ARGUMENT: No OpKernel was registered to support Op 'GeluCustom' used by {node type_embed_net/GeluCustom} with these attrs: [T=DT_FLOAT]
Registered devices: [CPU, GPU]
Registered kernels:
  <no registered kernels>

	 [[type_embed_net/GeluCustom]]

I also tried to train a new potential using the ROCM version but I'm running into this error when using "activation_function": "gelu"

    File "<string>", line 1261, in gelu_custom
Node: 'type_embed_net/GeluCustom'
No OpKernel was registered to support Op 'GeluCustom' used by {{node type_embed_net/GeluCustom}} with these attrs: [T=DT_FLOAT]
Registered devices: [CPU, GPU]
Registered kernels:
  <no registered kernels>

         [[type_embed_net/GeluCustom]]

and this error when using "activation_function": "gelu_tf"

Traceback (most recent call last):
  File "/cfs/klemming/projects/snic/ltu-fy/danhed/miniconda3/envs/deepmd-2.2.3/lib/python3.10/site-packages/tensorflow/python/ops/gradients_util.py", line 611, in _GradientsHelper
    grad_fn = ops.get_gradient_function(op)
  File "/cfs/klemming/projects/snic/ltu-fy/danhed/miniconda3/envs/deepmd-2.2.3/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 2838, in get_gradient_function
    return gradient_registry.lookup(op_type)
  File "/cfs/klemming/projects/snic/ltu-fy/danhed/miniconda3/envs/deepmd-2.2.3/lib/python3.10/site-packages/tensorflow/python/framework/registry.py", line 95, in lookup
    raise LookupError(
LookupError: gradient registry has no entry for: GeluGrad

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/cfs/klemming/projects/snic/ltu-fy/danhed/miniconda3/envs/deepmd-2.2.3/bin/dp", line 8, in <module>
    sys.exit(main())
  File "/cfs/klemming/projects/snic/ltu-fy/danhed/miniconda3/envs/deepmd-2.2.3/lib/python3.10/site-packages/deepmd_cli/main.py", line 595, in main
    deepmd_main(args)
  File "/cfs/klemming/projects/snic/ltu-fy/danhed/miniconda3/envs/deepmd-2.2.3/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 73, in main
    train_dp(**dict_args)
  File "/cfs/klemming/projects/snic/ltu-fy/danhed/miniconda3/envs/deepmd-2.2.3/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 165, in train
    _do_work(jdata, run_opt, is_compress)
  File "/cfs/klemming/projects/snic/ltu-fy/danhed/miniconda3/envs/deepmd-2.2.3/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 277, in _do_work
    model.build(train_data, stop_batch, origin_type_map=origin_type_map)
  File "/cfs/klemming/projects/snic/ltu-fy/danhed/miniconda3/envs/deepmd-2.2.3/lib/python3.10/site-packages/deepmd/train/trainer.py", line 310, in build
    self._build_training()
  File "/cfs/klemming/projects/snic/ltu-fy/danhed/miniconda3/envs/deepmd-2.2.3/lib/python3.10/site-packages/deepmd/train/trainer.py", line 456, in _build_training
    apply_op = optimizer.minimize(
  File "/cfs/klemming/projects/snic/ltu-fy/danhed/miniconda3/envs/deepmd-2.2.3/lib/python3.10/site-packages/tensorflow/python/training/optimizer.py", line 477, in minimize
    grads_and_vars = self.compute_gradients(
  File "/cfs/klemming/projects/snic/ltu-fy/danhed/miniconda3/envs/deepmd-2.2.3/lib/python3.10/site-packages/tensorflow/python/training/optimizer.py", line 603, in compute_gradients
    grads = gradients.gradients(
  File "/cfs/klemming/projects/snic/ltu-fy/danhed/miniconda3/envs/deepmd-2.2.3/lib/python3.10/site-packages/tensorflow/python/ops/gradients_impl.py", line 165, in gradients
    return gradients_util._GradientsHelper(
  File "/cfs/klemming/projects/snic/ltu-fy/danhed/miniconda3/envs/deepmd-2.2.3/lib/python3.10/site-packages/tensorflow/python/ops/gradients_util.py", line 639, in _GradientsHelper
    raise LookupError(
LookupError: No gradient defined for operation'gradients/filter_type_all/Gelu/Gelu_grad/GeluGrad' (op type: GeluGrad). In general every operation must have an associated `@tf.RegisterGradient` for correct autodiff, which this op is lacking. If you want to pretend this operation is a constant in your program, you may insert `tf.stop_gradient`. This can be useful to silence the error in cases where you know gradients are not needed, e.g. the forward pass of tf.custom_gradient. Please see more details in https://www.tensorflow.org/api_docs/python/tf/custom_gradient.

Any ideas how to solve this?

4 replies

njzjz Sep 6, 2023
Maintainer

Could you commit and push the branch you modified to a forked repository? I'd like to see how it is modified.

Dankomaister Sep 14, 2023

Hi @njzjz

I don´t have a branch with my modifications, but I only modified gelu_multi_device.cc file.
Attached is this file with my modifications.
gelu_multi_device.zip

/Daniel

njzjz Sep 14, 2023
Maintainer

Please see this SO answer https://stackoverflow.com/a/24752005/9567349. Using // in a macro will cause the whole macro to be commented as the multiple lines are merged into one line.

Dankomaister Sep 18, 2023

Great its working now thanks!

AlexisEspinosaGayosso · 2024-01-19T03:05:36Z

AlexisEspinosaGayosso
Jan 19, 2024

Dear @njzjz , I'm facing troubles following the instructions given by @sigbjobo to properly install DeepMD-Kit in our HPE-Cray EX (similar to Lumi). I have not reached the full set of steps, but I have followed these so far:

# rccl already  provided as part of rocm vendor installation

# Tensorflow with rocm:
module load PrgEnv-gnu/8.3.3
module load cray-python/3.9.13.1
module load rocm/5.2.3
module load cmake/3.24.3
module load craype-accel-amd-gfx90a
python3 -m venv deepmd-env # needs to be after loading cray-python
source deepmd-env/bin/activate
LDFLAGS='-L/opt/rocm/lib -lamdhip64' pip install tensorflow-rocm==2.8.*

So far so good. These above does not seem to give any problems.

But problems appear in the DeepMD-Kit installation:

#DeepMD-Kit
git clone --recursive https://github.com/deepmodeling/deepmd-kit.git deepmd-kit 
cd deepmd-kit
sed -i '/gelu_multi_device.cc/d' source/op/CMakeLists.txt
export DP_VARIANT=rocm
export HIP_HIPCC_FLAGS="--amdgpu-target=gfx90a" 
pip install .

A first problem with this is that the compilation of the *.cu files fails. Error messages look like this:

× Building wheel for deepmd-kit (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [92 lines of output]
      *** scikit-build-core 0.5.1 using CMake 3.24.3 (wheel)
      *** Configuring CMake...
      loading initial cache file build/py37-none-manylinux_2_31_x86_64/CMakeInit.txt
      -- Cray Programming Environment 2.7.20 C
      -- Cray Programming Environment 2.7.20 CXX
      -- Supported model version: 1.1
      -- Will not build nv GPU support
      -- The HIP compiler identification is Clang 14.0.0
      -- hip::amdhip64 is SHARED_LIBRARY
      -- hip::amdhip64 is SHARED_LIBRARY
      -- Found ROCM in /opt/rocm, build AMD GPU support
      -- Disabled cpp interface build, looking for tensorflow_framework
      CMake Warning (dev) at cmake/Findtensorflow.cmake:234 (file):
        You have used file(GET_RUNTIME_DEPENDENCIES) in project mode.  This is
        probably not what you intended to do.  Instead, please consider using it in
        an install(CODE) or install(SCRIPT) command.  For example:

          install(CODE [[
            file(GET_RUNTIME_DEPENDENCIES
              # ...
              )
            ]])
      Call Stack (most recent call first):
        CMakeLists.txt:147 (find_package)
      This warning is for project developers.  Use -Wno-dev to suppress it.

      -- Found TensorFlow: /software/projects/manual/software/pythonEnvironments/cray-python/3.9.13.1/deepmd-env/lib/python3.9/site-packages/tensorflow/include/;/software/projects/manual/software/pythonEnvironments/cray-python/3.9.13.1/deepmd-env/lib/python3.9/site-packages/tensorflow/include/, , /software/projects/manual/software/pythonEnvironments/cray-python/3.9.13.1/deepmd-env/lib/python3.9/site-packages/tensorflow/libtensorflow_framework.so.2  in /software/projects/manual/software/pythonEnvironments/cray-python/3.9.13.1/deepmd-env/lib/python3.9/site-packages/tensorflow;/software/projects/manual/software/pythonEnvironments/cray-python/3.9.13.1/deepmd-env/lib/python3.9/site-packages/tensorflow/../tensorflow_core (found version "2.8.3")
      -- Automatically determined OP_CXX_ABI=0
      -- Set GLIBCXX_USE_CXX_ABI=0
      -- HIP major version is 5
      -- Configuring done
      -- Generating done
      -- Build files have been written to: /software/projects/manual/software/pythonEnvironments/cray-python/3.9.13.1/deepmd-kit/build/py37-none-manylinux_2_31_x86_64
      *** Building project with Ninja...
      [1/14] Building HIP object lib/src/gpu/CMakeFiles/deepmd_op_rocm.dir/tabulate.cu.o
      FAILED: lib/src/gpu/CMakeFiles/deepmd_op_rocm.dir/tabulate.cu.o
      /opt/rocm/llvm/bin/clang++ -DCUB_IGNORE_DEPRECATED_CPP_DIALECT -DTENSORFLOW_USE_ROCM -D_GLIBCXX_USE_CXX11_ABI=0 -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_HCC__=1 -D__HIP_ROCclr__=1 -Ddeepmd_op_rocm_EXPORTS -I/software/projects/manual/software/pythonEnvironments/cray-python/3.9.13.1/deepmd-kit/source/lib/src/gpu/../../include -isystem /opt/rocm-5.2.3/include -isystem /opt/rocm/llvm/lib/clang/14.0.0 -isystem /opt/rocm/include -isystem /opt/rocm/llvm/lib/clang/14.0.0/include/.. -fno-gpu-rdc;--amdgpu-target=gfx90a -O3 -DNDEBUG --offload-arch=gfx803 -fPIC -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false -std=gnu++14 -MD -MT lib/src/gpu/CMakeFiles/deepmd_op_rocm.dir/tabulate.cu.o -MF lib/src/gpu/CMakeFiles/deepmd_op_rocm.dir/tabulate.cu.o.d -o lib/src/gpu/CMakeFiles/deepmd_op_rocm.dir/tabulate.cu.o -x hip -c /software/projects/manual/software/pythonEnvironments/cray-python/3.9.13.1/deepmd-kit/source/lib/src/gpu/tabulate.cu
      clang-14: error: no input files
      /bin/sh: --amdgpu-target=gfx90a: command not found

For some reason, somewhere in the process a semicolon (;) is being inserted in between the -fno-gpu-rdc and the --amdgpu-target=gfx90a options. I tried replacing the set command in the source/lib/src/gpu/CMakeLists.txt file with a list(APPEND CMAKE_HIP_FLAGS -fno-gpu-rdc), but the semicolon still appears in the compilation command. So, something is adding that nasty semicolon.

So, the first question here is:

What is adding this nasty semicolon?

Anyway, I separately and manually tried to execute the compilation command (obviously removing manually the nasty semicolon, and the compiler complained anyway about the --amdgpu-target=gfx90a option. Error message is: clang-14: error: unsupported option '--amdgpu-target=gfx90a'. So, anyway, that option can't be used with the intended compiler, so what I did was to unset the environment variable related to this. This solved for two purposes, avoiding the insertion of a nasty semicolon, but also avoiding the use of this unsupported option.

unset HIP_HIPCC_FLAGS

(Anyway, question 1 still applies, even if unsetting HIP_HIPCC_FLAGS bypasses the problem.)

But there still is a second problem in the compilation command, which is the use of --offload-arch=gfx803 while I need to use: --offload-arch=gfx90a.
I was able to change this manually by editing the build/py37-none-manylinux_2_31_x86_64/CMakeCache.txt file and manually updating to CMAKE_HIP_ARCHITECTURES:STRING=gfx90a. Then I removed the compiled object files of the *.cu files and execute pip install . again. This cumbersome manual procedure worked, but I need something more robust.

So, second question here is:
2. What setting/environment variable is needed in order to indicate the use of gfx90a architecture during compilation instead of the gfx803?

Thanks. So far these are the questions. I will try to move forward and will come back if any problems that I can solved appear.

Regards,
Alexis

5 replies

njzjz Jan 19, 2024
Maintainer

It seems CMAKE_HIP_FLAGS should be a string instead of a list, so the correct CMake command should be

set(CMAKE_HIP_FLAGS "-fno-gpu-rdc ${CMAKE_HIP_FLAGS}")

For the architecture, you are correct to set CMAKE_HIP_ARCHITECTURES. See https://scikit-build-core.readthedocs.io/en/latest/configuration.html#configuring-cmake-arguments-and-defines for how to set CMake arguments for scikit-build-core.

njzjz Jan 19, 2024
Maintainer

I've created #3155. Thanks for the report!

sigbjobo Jan 19, 2024
Author

@njzjz, I think something broke from around v2.4. @AlexisEspinosaGayosso, this is what I do nowadays:

git clone -b v2.2.4 https://github.com/deepmodeling/deepmd-kit.git
cd deepmd-kit
sed -i 's/setuptools>=61/setuptools==68.2.2/g' pyproject.toml
sed -i '/gelu_multi_device.cc/d' source/op/CMakeLists.txt
export DP_VARIANT=rocm
export HIP_HIPCC_FLAGS="--amdgpu-target= gfx803" 
pip install .
cd ..

But the root cause should be dealt with in accordance with @njzjz .

AlexisEspinosaGayosso Jan 29, 2024

Thanks a lot @sigbjobo , your latest recommemdation indeed works. Although I'm a bit concern on the chosen setting for the architecture. I think that for Lumi and Pawsey it should be gfx90a. Unfortunately I couldn't find a way to adapt your latest recommendation to the use of gfx90a. Then I came back to the use of the latest deepmd-kit.git commit together with only the following environmental variables:

git clone --recursive https://github.com/deepmodeling/deepmd-kit.git
cd deepmd-kit
sed -i 's/setuptools>=61/setuptools==68.2.2/g' pyproject.toml
sed -i '/gelu_multi_device.cc/d' source/op/CMakeLists.txt
export DP_VARIANT=rocm
unset HIP_HIPCC_FLAGS
export SKBUILD_CMAKE_DEFINE=CMAKE_HIP_ARCHITECTURES:STRING=gfx90a
pip install .

This seems to properly install deepmd-kit taking into account the desired architecture.

Unfortunately, this is the farthest point I have been able to reach. The next package (Horovod) installation is failing for me. It seems that installation process is not being able to find my system MPICH. I'll try to figure out how to fix this and, if not, then post my description of the problem here to see if you guys could help.

sigbjobo Jan 29, 2024
Author

Sorry, I misunderstood; I got confused by your previous post mentioning this architecture.

sigbjobo · 2024-09-18T04:44:55Z

sigbjobo
Sep 18, 2024
Author

Hi, after LUMI was updated, the old installation stopped working. The problem is that the installation points to /opt/rocm as sketched in this issue, which has changed from rocm5 to rocm6. @njzjz, do you know how to ensure the deepmd points to the correct version of ROCM? I am okay with a workaround as here

3 replies

sigbjobo Sep 18, 2024
Author

For future reference:

Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
Traceback (most recent call last):
  File "deepmd-env/lib/python3.9/site-packages/deepmd/env.py", line 406, in get_module
    module = tf.load_op_library(str(module_file))
  File "deepmd-env/lib/python3.9/site-packages/tensorflow/python/framework/load_library.py", line 54, in load_op_library
    lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: /opt/rocm-6.0.3/lib/libamdhip64.so.6: undefined symbol: hsa_amd_memory_async_copy_on_engine, version ROCR_1

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "=deepmd-env/bin/dp", line 8, in <module>
    sys.exit(main())
  File "=deepmd-env/lib/python3.9/site-packages/deepmd_cli/main.py", line 603, in main
    from deepmd.entrypoints.main import main as deepmd_main
  File "deepmd-env/lib/python3.9/site-packages/deepmd/__init__.py", line 11, in <module>
    import deepmd.utils.network as network
  File "deepmd-env/lib/python3.9/site-packages/deepmd/utils/__init__.py", line 3, in <module>
    from .data import (
  File "deepmd-env/lib/python3.9/site-packages/deepmd/utils/data.py", line 12, in <module>
    from deepmd.env import (
  File "deepmd-env/lib/python3.9/site-packages/deepmd/env.py", line 492, in <module>
    op_module = get_module("deepmd_op")
  File "deepmd-env/lib/python3.9/site-packages/deepmd/env.py", line 463, in get_module
    raise RuntimeError(error_message) from e
RuntimeError: This deepmd-kit package is inconsitent with TensorFlow Runtime, thus an error is raised when loading deepmd_op. You need to rebuild deepmd-kit against this TensorFlow runtime.
(deepmd-env) boresigb@uan03:> echo $ROCM_PATH 
/appl/lumi/SW/LUMI-22.08/G/EB/rocm/5.3.3```

sigbjobo Sep 18, 2024
Author

I think the problem that we have ROCM_PATH and ROCM_ROOT, if the last one is not set, then it seems that /rocm/opt/ get used. If you agree, @njzjz, I can create a pull request?

njzjz Sep 18, 2024
Maintainer

I am not against having an alias for ROCM_PATH.

if the last one is not set, then it seems that /rocm/opt/ get used.

I think it is generated by hipconfig --rocmpath, which is the way that CMake finds ROCm.

sigbjobo · 2024-09-20T08:41:14Z

sigbjobo
Sep 20, 2024
Author

For anyone wondering, this is working installation for LUMI after updating to rocm6:

#DeePMD python modules

module reset
module load LUMI/24.03 partition/G
module load buildtools
module load cray-python
module unload PrgEnv-cray
module load PrgEnv-gnu
module load rocm
export ROCM_ROOT=$ROCM_PATH



#tensorflow
START_DIR=$PWD
python3 -m venv deepmd-env
source deepmd-env/bin/activate
CXX=g++-12 CC=gcc-12 LD=g++-12 LDFLAGS="-L$ROCM_PATH/lib/ -lamdhip64"  pip install tensorflow-rocm==2.14.* --verbose
pip install numpy==1.24.0
pip install setuptools==68.2.2


# Deepmd-kit fitting
git clone https://github.com/deepmodeling/deepmd-kit.git -b v2.2.9 --depth 1
cd deepmd-kit
sed -i '/gelu_multi_device.cc/d' source/op/CMakeLists.txt

export DP_VARIANT=rocm
export SKBUILD_CMAKE_DEFINE=CMAKE_HIP_ARCHITECTURES:STRING=gfx90a
CXX=g++-12 CC=gcc-12 LD=g++-12 LDFLAGS="-L$ROCM_PATH/lib/ -lamdhip64"  pip install . --verbose
cd ..

#Download LAMMPS
wget https://github.com/lammps/lammps/archive/refs/tags/stable_29Aug2024.tar.gz 
tar -xf stable_29Aug2024.tar.gz


#DeePMD :LAMMPS plugin
cd deepmd-kit/build
mkdir plugins
cd plugins

module reset
module load LUMI/24.03 partition/G
module load buildtools/24.03
module load cray-python
module unload PrgEnv-cray
module load PrgEnv-gnu
module load rocm
export ROCM_ROOT=$ROCM_PATH

source ${START_DIR}/deepmd-env/bin/activate #change with your own deepmd-env

export LAMMPS_ROOT="${START_DIR}/lammps-stable_29Aug2024" #change with the path of your LAMMPS source
export deepmd_root="${START_DIR}/deepmd-kit-install" #change with the path of  where you want deepmd plugins to be installed

cmake -DCMAKE_INSTALL_PREFIX=$deepmd_root -DUSE_ROCM_TOOLKIT=TRUE -DUSE_TF_PYTHON_LIBS=TRUE -DLAMMPS_SOURCE_ROOT=$LAMMPS_ROOT  -DCMAKE_BUILD_TYPE=Release -DHIP_HIPCC_FLAGS="--amdgpu-target=gfx90a" -DDP_VARIANT=rocm -DCMAKE_HIP_ARCHITECTURES=gfx90a ../../source

make -j 8 
make install
cd $START_DIR
PATH=$deepmd_root/bin:$PATH
PATH=$deepmd_root/lib:$PATH

#COMPILE PLUMED set correct path as per your need
wget  https://github.com/plumed/plumed2/archive/refs/tags/v2.9.2.tar.gz
tar -xf v2.9.2.tar.gz
mkdir plumed-install
PLUMED_INSTALL=$(realpath plumed-install)
cd plumed2-2.9.2
./configure --enable-modules=all CXX=CC CC=cc  CXXFLAGS=-O3 --prefix=$PLUMED_INSTALL

make -j 16 #Make gives some error about some json files but it seems it does not affect the main part of the plumed kernel
make install

cd ../plumed-install
mkdir pkgconfig
cd pkgconfig
#CHANGE the prefix= below to the path of your plume install
echo " 

prefix= $PLUMED_INSTALL
exec_prefix=\${prefix}
libdir=\${exec_prefix}/lib
includedir=\${prefix}/include

Name: plumed
Description: plumed
Version: 2.9.2
Libs: -L\${libdir}  
Cflags: -I\${includedir}
" > plumed.pc

cd $START_DIR



#Compile LAMMPS remember to change the LAMMPS install path according to your system
mkdir lammps-install
LAMMPS_INSTALL=$(realpath  lammps-install)
cd  lammps-stable_29Aug2024
mkdir build
cd build

module reset
module load LUMI/24.03 partition/G
module load buildtools/24.03
module load cray-python
module unload PrgEnv-cray
module load PrgEnv-gnu
module load rocm
module load cray-fftw/3.3.10.7
export ROCM_ROOT=$ROCM_PATH
export PKG_CONFIG_PATH="$PLUMED_INSTALL/pkgconfig:$PKG_CONFIG_PATH"

export PATH
cmake -DCMAKE_CXX_COMPILER=g++-12  -D CMAKE_C_COMPILER=gcc-12  -D CMAKE_Fortran_COMPILER=gfortran-12 -D PKG_PLUMED=yes -D PLUMED_MODE=runtime -D DOWNLOAD_PLUMED=NO -D GPU_API=HIP -D HIP_PATH=/opt/rocm-6.0.3/ -D HIP_ARCH=gfx90a  -D PKG_OPENMP=yes   -D PKG_PLUGIN=ON -D PKG_KSPACE=ON -D PKG_EXTRA-FIX=ON -D LAMMPS_INSTALL_RPATH=ON -DBUILD_SHARED_LIBS=yes -D CMAKE_INSTALL_PREFIX=$LAMMPS_INSTALL ../cmake

make -j 8
make install


echo """module reset
module load LUMI/24.03 partition/G
module load buildtools/24.03
module load cray-python
module unload PrgEnv-cray
module load PrgEnv-gnu
module load rocm
module load cray-fftw/3.3.10.7
export ROCM_ROOT=$ROCM_PATH
source $PWD/deepmd-env/bin/activate
export LAMMPS_PLUGIN_PATH=$PWD/deepmd-kit-install/lib/deepmd_lmp:\${LAMMPS_PLUGIN_PATH}
PATH=$PWD/lammps-install/bin/:$PWD/deepmd-kit-install/bin/:\$PATH
export LD_LIBRARY_PATH=$PWD/deepmd-kit-install/lib:\$(python -c 'import tensorflow; print(tensorflow.sysconfig.get_lib())'):\${LD_LIBRARY_PATH}
"""  > env.sh

0 replies

DeePMD + LAMMPS on LUMI #2523

Uh oh!

Uh oh!

sigbjobo May 12, 2023

Replies: 14 comments · 24 replies

Uh oh!

njzjz May 12, 2023 Maintainer

Uh oh!

sigbjobo May 14, 2023 Author

Uh oh!

Uh oh!

njzjz May 14, 2023 Maintainer

Uh oh!

njzjz May 14, 2023 Maintainer

Uh oh!

Uh oh!

sigbjobo May 15, 2023 Author

Uh oh!

njzjz May 15, 2023 Maintainer

Uh oh!

sigbjobo May 15, 2023 Author

Uh oh!

njzjz May 15, 2023 Maintainer

Uh oh!

Uh oh!

sigbjobo May 25, 2023 Author

Tensorflow with ROCM:

DeePMD-kit:

Horovod:

LAMMPS with DeePMD-kit:

Create environment

Tests

Model training

Running Lammps

Uh oh!

njzjz May 25, 2023 Maintainer

Uh oh!

Uh oh!

sigbjobo Jun 27, 2023 Author

Tensorflow with ROCM:

DeePMD-kit:

Horovod:

LAMMPS with DeePMD-kit:

Create environment

Uh oh!

njzjz Jun 27, 2023 Maintainer

Uh oh!

sigbjobo Jun 27, 2023 Author

Uh oh!

Dankomaister Jun 29, 2023

Uh oh!

sigbjobo Jun 29, 2023 Author

Uh oh!

njzjz Jun 29, 2023 Maintainer

Uh oh!

Dankomaister Aug 14, 2023

Uh oh!

Dankomaister Aug 22, 2023

Uh oh!

njzjz Aug 22, 2023 Maintainer

Uh oh!

Dankomaister Aug 22, 2023

Uh oh!

sigbjobo Aug 22, 2023 Author

Uh oh!

Uh oh!

njzjz Aug 22, 2023 Maintainer

Uh oh!

Dankomaister Sep 6, 2023

Uh oh!

njzjz Sep 6, 2023 Maintainer

Uh oh!

Dankomaister Sep 14, 2023

Uh oh!

sigbjobo
May 12, 2023

Replies: 14 comments 24 replies

njzjz
May 12, 2023
Maintainer

sigbjobo
May 14, 2023
Author

njzjz
May 14, 2023
Maintainer

njzjz May 14, 2023
Maintainer

sigbjobo
May 15, 2023
Author

njzjz May 15, 2023
Maintainer

sigbjobo
May 15, 2023
Author

njzjz May 15, 2023
Maintainer

sigbjobo
May 25, 2023
Author

njzjz May 25, 2023
Maintainer

sigbjobo
Jun 27, 2023
Author

njzjz
Jun 27, 2023
Maintainer

sigbjobo Jun 27, 2023
Author

Dankomaister
Jun 29, 2023

sigbjobo Jun 29, 2023
Author

njzjz Jun 29, 2023
Maintainer

Dankomaister
Aug 14, 2023

njzjz Aug 22, 2023
Maintainer

sigbjobo Aug 22, 2023
Author

njzjz Aug 22, 2023
Maintainer

Dankomaister
Sep 6, 2023

njzjz Sep 6, 2023
Maintainer