Scripts and data for the paper titled "Application-specific Machine-Learned Interatomic Potentials: Exploring the Trade-off Between Precision and Computational Cost".
This repository contains a set of Python scripts and Jupyter notebooks for generating, training, and analyzing machine-learned interatomic potentials (MLIPs). The workflow focuses on creating qSNAP potentials from DFT data and evaluating their performance under various conditions.
To run the scripts and notebooks in this repository, you will need the following Python libraries:
ase==3.24.0
fitsnap3==3.1.0
jupyter==1.0.0
lammps==2023.08.02
matplotlib==3.10.1
mpi4py
nbformat==5.9.2
numpy==1.25.0
openmpi
pandas==2.0.2
pytables==3.10.1
scipy==1.11.0
setuptools==75.8.2
The recommended way to set up the environment is by using the provided environment.yml
file with Conda:
conda env create -f environment.yml
This ensures that all dependencies, including the parallel version of LAMMPS, are installed correctly. For advanced use cases or alternative installation methods (e.g., with ACE descriptor support), please refer to the official FitSNAP installation guide.
Be_dataset/
: This directory contains the Beryllium (Be) dataset used for training and testing.Be_structures.h5
: An HDF5 file containing 20,000 atomic configurations generated by entropy maximization method.Be_prec_i.h5
: HDF5 files (i
from 1 to 6) containing the corresponding energies, forces, and stresses calculated via DFT at 6 different precision levels using pyiron. More details about the precision levels can be found in the paper. This dataset was generated at Los Alamos National Laboratory (LANL) and approved for public release under number LA-UR-25-26282.
The experiments are designed to be run from specific subdirectories within the data/
directory. Each subdirectory contains a run.sh
script that executes the main Python scripts with the correct arguments. All the results shown in the data directory are for 2Jmax=6
case. You can generate the same data for other 2Jmax
values by changing the arguments passed to python scripts in run.sh file.
General Workflow:
- Navigate to the relevant subdirectory inside
data/
(e.g.,cd data/2D_pareto/
). - Execute the shell script:
sh run.sh
. - The script will call the main Python script from the parent directory.
- Output files (e.g.,
.npy
matrices or.csv
results) will be generated and saved within that same subdirectory. - Jupyter notebooks can then be used to plot the figures shown in the paper based on the generated results.
The main scripts are located in the root directory. They are executed from the subdirectories within data/
, which is also where their outputs are saved.
-
data/numpy_matrices_for_fitting/
- Corresponding Script:
get_matrices_for_fitting.py
- Purpose: This directory is for generating the descriptor matrices needed for model fitting.
run.sh
executes:get_matrices_for_fitting.py
.- Output: Saves descriptor matrices and target vectors as
.npy
files within this directory.
- Corresponding Script:
-
data/leverage_scores_dataframe/
- Corresponding Script:
get_leverage_scores_data.py
- Purpose: This directory is for calculating the leverage scores from the previously generated descriptor matrices.
run.sh
executes:get_leverage_scores_data.py
.- Input: Reads the
.npy
matrices fromdata/numpy_matrices_for_fitting/
. - Output: Saves leverage score data as
df_leevrage.csv
file within this directory.
- Corresponding Script:
-
data/sampling_methods_comparison/N
whereN
is the number of configurations to sample from the total data set of 20,000 configurations- Corresponding Script:
get_sampling_performance_data.py
- Purpose: To generate performance data by training models on subsets of data selected via leverage, block leverage, and random sampling.
run.sh
executes:get_sampling_performance_data.py
.- Input: Reads leverage scores from
data/leverage_scores_dataframe/
. - Output: Saves the performance results in a
results.csv
file inside this directory. This data is visualized using theplot_sampling_performance.ipynb
notebook to create Figure 6.
- Corresponding Script:
-
data/2D_pareto/
- Corresponding Script:
get_2D_pareto_data.py
- Purpose: To generate data for the 2D Pareto front by fitting models with different energy/force weights.
run.sh
executes:get_2D_pareto_data.py
.- Input: Reads descriptor matrices from
data/numpy_matrices_for_fitting/
. - Output: Saves the energy/force RMSE results in a
results.csv
file here. This data is used byplot_2D_pareto.ipynb
to create Figures 4 and 5.
- Corresponding Script:
-
data/3D_pareto/
- Corresponding Script:
get_3D_pareto_data.py
- Purpose: To perform a large-scale factorial experiment by training and testing MLIPs at varying subset sizes, energy/force weights, model complexities, and DFT precisions.
run.sh
executes:get_3D_pareto_data.py
.- Input: Reads all necessary matrices and leverage scores from other
data
subdirectories. - Output: Saves the comprehensive results in a
results.csv
file in this directory. Theplot_3D_pareto.ipynb
notebook uses this data for the Pareto analysis in Figures 7 and 8 of the main paper and Figure 7 of SI.
- Corresponding Script:
-
helper_functions.py
: Contains various utility functions (e.g., for data loading, error calculation) that are imported and used by the main scripts and notebooks. -
get_matrices_for_fitting.py
: Generates descriptor matrices and corresponding energy/force vectors from the reference dataset. It takes2Jmax
and precision level as arguments. -
perform_one_fit.py
: Handles the training of a single qSNAP model. It reads the matrices generated byget_matrices_for_fitting.py
, sets the energy/force weights, and performs linear regression and then prints training and testing RMSE values for energy and forces. -
get_leverage_scores_data.py
: Calculates leverage scores based on a descriptor matrix. -
get_sampling_performance_data.py
: Performs subsampling and fitting to compare random, leverage, and block leverage sampling methods. -
get_2D_pareto_data.py
: Runs a series of fits with different energy/force weights to generate data for 2D Pareto plots. -
get_3D_pareto_data.py
: Trains MLIPs in a full factorial design to generate data for the multi-objective Pareto analysis. -
Jupyter Notebooks (
plot_*.ipynb
): These are used for visualizing the data generated by the scripts and creating the figures presented in the paper.