v3.0.0 #4409

njzjz · 2024-11-23T08:10:23Z

njzjz
Nov 23, 2024
Maintainer

DeePMD-kit v3: Multiple-backend Framework, DPA-2 Large Atomic Model, and Plugin Mechanisms

After eight months of public tests, we are excited to present the first stable version of DeePMD-kit v3, an advanced version that enables deep potential models with TensorFlow, PyTorch, or JAX backends. Additionally, DeePMD-kit v3 introduces support for the DPA-2 model, a novel architecture optimized for large atomic models. This release enhances plugin mechanisms, making integrating and developing new models easier.

Highlights

Multiple-backend framework: TensorFlow, PyTorch, and JAX support

DeePMD-kit v3 adds a versatile, pluggable framework providing consistent training and inference experience across multiple backends. Version 3.0.0 includes:

TensorFlow backend: Known for its computational efficiency with a static graph design.
PyTorch backend: A dynamic graph backend that simplifies model extension and development.
DP backend: Built with NumPy and Array API, a reference backend for development without heavy deep-learning frameworks.
JAX backend: Based on the DP backend via Array API, a static graph backend.

Features	TensorFlow	PyTorch	JAX	DP
Descriptor local frame	✅
Descriptor se_e2_a	✅	✅	✅	✅
Descriptor se_e2_r	✅	✅	✅	✅
Descriptor se_e3	✅	✅	✅	✅
Descriptor se_e3_tebd		✅	✅	✅
Descriptor DPA1	✅	✅	✅	✅
Descriptor DPA2		✅	✅	✅
Descriptor Hybrid	✅	✅	✅	✅
Fitting energy	✅	✅	✅	✅
Fitting dipole	✅	✅	✅	✅
Fitting polar	✅	✅	✅	✅
Fitting DOS	✅	✅	✅	✅
Fitting property		✅	✅	✅
ZBL	✅	✅	✅	✅
DPLR	✅
DPRc	✅	✅	✅	✅
Spin	✅	✅		✅
Gradient calculation	✅	✅	✅
Model training	✅	✅
Model compression	✅	✅
Python inference	✅	✅	✅	✅
C++ inference	✅	✅	✅

Critical features of the multiple-backend framework include the ability to:

Train models using different backends with the same training data and input script, allowing backend switching based on your efficiency or convenience needs.

# Training a model using the TensorFlow backend
dp --tf train input.json
dp --tf freeze
dp --tf compress

# Training a model using the PyTorch backend
dp --pt train input.json
dp --pt freeze
dp --pt compress

Convert models between backends using dp convert-backend, with backend-specific file extensions (e.g., .pb for TensorFlow and .pth for PyTorch).

# Convert from a TensorFlow model to a PyTorch model
dp convert-backend frozen_model.pb frozen_model.pth
# Convert from a PyTorch model to a TensorFlow model
dp convert-backend frozen_model.pth frozen_model.pb
# Convert from a PyTorch model to a JAX model
dp convert-backend frozen_model.pth frozen_model.savedmodel
# Convert from a PyTorch model to the backend-independent DP format
dp convert-backend frozen_model.pth frozen_model.dp

Run inference across backends via interfaces like dp test, Python/C++/C interfaces, or third-party packages (e.g., dpdata, ASE, LAMMPS, AMBER, Gromacs, i-PI, CP2K, OpenMM, ABACUS, etc.).

# In a LAMMPS file:
# run LAMMPS with a TensorFlow backend model
pair_style deepmd frozen_model.pb
# run LAMMPS with a PyTorch backend model
pair_style deepmd frozen_model.pth
# run LAMMPS with a JAX backend model
pair_style deepmd frozen_model.savedmodel
# Calculate model deviation using different models
pair_style deepmd frozen_model.pb frozen_model.pth frozen_model.savedmodel out_file md.out out_freq 100

Add a new backend to DeePMD-kit much more quickly if you want to contribute to DeePMD-kit.

DPA-2 model: a large atomic model as a multi-task learner

The DPA-2 model offers a robust architecture for large atomic models (LAM), accurately representing diverse chemical systems for high-quality simulations. In this release, DPA-2 can be trained using the PyTorch backend, supporting both single-task (see examples/water/dpa2) or multi-task (see examples/water_multi_task/pytorch_example) training schemes. DPA-2 is available for Python/C++ inference in the JAX backend.

The DPA-2 descriptor comprises repinit and repformer, as shown below.

The PyTorch backend supports training strategies for large atomic models, including:

Parallel training: Train large atomic models on multiple GPUs for efficiency.

torchrun --nproc_per_node=4 --no-python dp --pt train input.json

Multi-task training: For large atomic models trained across a broad range of data calculated on different DFT levels with shared descriptors. An example is given in examples/water_multi_task/pytorch_example/input_torch.json.
Finetune: Training a pre-train large atomic model on a smaller, task-specific dataset. The PyTorch backend has supported --finetune argument in the dp --pt train command line.

Plugin mechanisms for external models

In version 3.0.0, the plugin capabilities have been implemented to support the development and integration of potential energy models using TensorFlow, PyTorch, or JAX backends, leveraging DeePMD-kit's trainer, loss functions, and interfaces. A plugin example is deepmd-gnn, which supports training the MACE and NequIP models in the DeePMD-kit with the familiar commands.

dp --pt train mace.json
dp --pt freeze
dp --pt test -m frozen_model.pth -s ../data/

Other new features

Descriptor se_e3_tebd. (feat(pt/dp): support three-body type embedding #4066)
Fitting the property (feat pt : Support property fitting #3867).
New training parameters: max_ckpt_keep (Add max_ckpt_keep for trainer #3441), change_bias_after_training (fix(pt): fix lammps nlist sort with large sel #3993), and stat_file.
New command line interface: dp change-bias (fix(pt): fix lammps nlist sort with large sel #3993) and dp show (feat(pt): Add command to check the available model branches in multi-task pre-trained model(Issue #3742) #3796).
Support generating JSON schema for integration with VSCode (feat: support generating JSON schema for integration with VSCode #3849).
The latest LAMMPS version (stable_29Aug2024_update1) is supported. (fix: bump LAMMPS to stable_29Aug2024 #4088, chore: bump LAMMPS to stable_29Aug2024_update1 #4179)

Breaking changes

The deepmodeling conda channel is deprecated. Use the conda-forge channel instead. (docs: add deprecation notice for the official conda channel and more conda docs #3462, docs: clean up deprecated deepmodeling conda channel docs #4385)
The offline package and conda packages for CUDA 11 are dropped.
Python 3.7 and 3.8 supports are dropped. (breaking: drop Python 3.7 support #3185, breaking: drop Python 3.8 support #4185)
The minimal versions of deep learning frameworks: TensorFlow 2.7, PyTorch 2.1, JAX 0.4.33, and NumPy 1.21.
We require all model files to have the correct filename extension for all interfaces so a corresponding backend can load them. TensorFlow model files must end with .pb extension.
Bias is removed by default from type embedding. (breaking(pt/tf/dp): disable bias in type embedding #3958)
The spin model is refactored, and its usage in the LAMMPS module has been changed. (Support DPSpin for AtomicModel #3301, feat(tf/pt): add/refact lammps & C++ support for spin model #4321)
Multi-task training support is removed from the TensorFlow backend. (breaking: remove multi-task support in tf #3763)
The set_prefix key is deprecated. (breaking: deprecate set_prefix #3753)
dp test now uses all sets for training and test. In previous versions, only the last set is used as the test set in dp test. (breaking: use all sets for training and test #3862)
The Python module structure is fully refactored. The old deepmd module was moved to deepmd.tf without other API changes, and deepmd_utils was moved to deepmd without other API changes. (breaking: move deepmd to deepmd.tf #3177, breaking: move deepmd_utils to deepmd #3178)
Python class DeepTensor (including DeepDiople and DeepPolar) now returns atomic tensor in the dimension of natoms instead of nsel_atoms. (breaking: change DeepTensor output dim from nsel_atoms to natoms #3390)
C++ 11 support is dropped. (breaking: drop C++ 11 #4068)

For other changes, refer to Full Changelog: v2.2.11...v3.0.0rc0

Contributors

The PyTorch backend was developed in the dptech-corp/deepmd-pytorch repository, and then it was fully merged into the deepmd-kit repository in #3180. Contributors to the deepmd-pytorch repository:

Contributors to the deepmd-kit repository:

We also thank everyone who did tests and reported bugs in the past eight months.

This discussion was created from the release v3.0.0.

sigbjobo · 2024-11-23T09:38:55Z

sigbjobo
Nov 23, 2024

Really exciting, @njzjz! Very curious, are there any performance differences between the tensorlfow, pytorch and jax? I am thinking of getting that version up and running on LUMI, amd based computer.

2 replies

njzjz Nov 23, 2024
Maintainer Author

I did some initial results on a V100 card. The model was generated by examples/water/se_atten_compressible/input.json. 12288 water atoms were simulated for 500 steps. (unit: s)

Backend	Original	Compressed
TensorFlow	76.7	23.9
PyTorch	73.0	26.2
JAX	71.0	Unsupported

@Yi-FanLi is going to do the same tests today on an H100 card.

Note: We just fixed some performance issues today that existed in the past eight months.

sigbjobo Nov 23, 2024

Interesting! Very similar so far.

Epikurz · 2024-11-24T09:42:27Z

Epikurz
Nov 24, 2024

Is the offline package with cuda-11 no longer provided?

7 replies

njzjz Dec 13, 2024
Maintainer Author

This is unfortunate. Cuda-11 is important for many deepmd users sitting on Centos os.

I don't get your point. The minimal NVIDIA driver version requirement for CUDA 12 is 525.60.13, which does support centos per its documentation. I also see centos is supported by R535 or R550.

chtchelkatchev Dec 13, 2024

I have access to three supercomputers running CentOS 7, which does not support CUDA 12 due to outdated graphics drivers. I have attempted to convince the system administrators to upgrade the operating system to CentOS-8, but it appears that it would be more practical to acquire a new supercomputer. Several of my colleagues find themselves in a similar situation. Occasionally, we also employ Kepler-80 GPUs, which are still functional and available in abundance. We have explored the use of Docker containers with CUDA 11, which has helped us circumvent the glibc 2.17 issue, but this solution is limited to CUDA 11 only. Consequently, there is a need for support to run deep learning models using CUDA 11 on this operating system, which is compatible with (the older glibc version) and old cuda drivers.

njzjz Dec 13, 2024
Maintainer Author

The documentation says R535 supports CentOS 7, so upgrading the driver to R535 is the easiest way.

Alternatively, the cuda-compat package supports running CUDA 12.x with R470.

You can also build from source. However, since the CentOS and the R470 drivers have gone to end-of-life, it is not expected that any new packages will support them.

chtchelkatchev Dec 14, 2024

Thank you for your prompt response. I have followed your instructions and created a conda environment using the command: "conda create -n pytorch_cu124_env cuda-compat pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia". However, when I run the test command "python -c 'import torch; print(torch.cuda.is_available())'", I am still getting a False result. This is different from the results I get with pytorch on CUDA 11, which returns True, and on other supercomputers running centos-7, which also return True. It seems like the problem with the cuda drivers can't be easily solved using anaconda methods alone. It would be helpful if you could provide at least an environment.yaml file for users running centos-7 so that we can build deepmd and LAMMPS ourselves on top of your conda environment.

njzjz Dec 21, 2024
Maintainer Author

I don't have a machine with old CUDA driver so I cannot do the test. I assume the below will work with the packages built against CUDA 12:

conda install cuda-compat -c conda-forge
export LD_LIBRARY_PATH=$CONDA_PREFIX/cuda_compat:$LD_LIBRARY_PATH

v3.0.0 #4409

Uh oh!

njzjz Nov 23, 2024 Maintainer

DeePMD-kit v3: Multiple-backend Framework, DPA-2 Large Atomic Model, and Plugin Mechanisms

Highlights

Multiple-backend framework: TensorFlow, PyTorch, and JAX support

DPA-2 model: a large atomic model as a multi-task learner

Plugin mechanisms for external models

Other new features

Breaking changes

Contributors

Replies: 2 comments · 9 replies

Uh oh!

sigbjobo Nov 23, 2024

Uh oh!

njzjz Nov 23, 2024 Maintainer Author

Uh oh!

sigbjobo Nov 23, 2024

Uh oh!

Epikurz Nov 24, 2024

Uh oh!

Uh oh!

njzjz Dec 13, 2024 Maintainer Author

Uh oh!

chtchelkatchev Dec 13, 2024

Uh oh!

njzjz Dec 13, 2024 Maintainer Author

Uh oh!

chtchelkatchev Dec 14, 2024

Uh oh!

njzjz Dec 21, 2024 Maintainer Author

njzjz
Nov 23, 2024
Maintainer

Replies: 2 comments 9 replies

sigbjobo
Nov 23, 2024

njzjz Nov 23, 2024
Maintainer Author

Epikurz
Nov 24, 2024

njzjz Dec 13, 2024
Maintainer Author

njzjz Dec 13, 2024
Maintainer Author

njzjz Dec 21, 2024
Maintainer Author