DFTprecisionMLIP

Scripts and data for the paper titled "Application-specific Machine-Learned Interatomic Potentials: Exploring the Trade-off Between Precision and Computational Cost".

This repository contains a set of Python scripts and Jupyter notebooks for generating, training, and analyzing machine-learned interatomic potentials (MLIPs). The workflow focuses on creating qSNAP potentials from DFT data and evaluating their performance under various conditions.

Dependencies

To run the scripts and notebooks in this repository, you will need the following Python libraries:

ase==3.24.0
fitsnap3==3.1.0
jupyter==1.0.0
lammps==2023.08.02
matplotlib==3.10.1
mpi4py
nbformat==5.9.2
numpy==1.25.0
openmpi
pandas==2.0.2
pytables==3.10.1
scipy==1.11.0
setuptools==75.8.2

The recommended way to set up the environment is by using the provided environment.yml file with Conda:

conda env create -f environment.yml

This ensures that all dependencies, including the parallel version of LAMMPS, are installed correctly. For advanced use cases or alternative installation methods (e.g., with ACE descriptor support), please refer to the official FitSNAP installation guide.

Dataset

Be_dataset/: This directory contains the Beryllium (Be) dataset used for training and testing.
- Be_structures.h5: An HDF5 file containing 20,000 atomic configurations generated by entropy maximization method.
- Be_prec_i.h5: HDF5 files (i from 1 to 6) containing the corresponding energies, forces, and stresses calculated via DFT at 6 different precision levels using pyiron. More details about the precision levels can be found in the paper. This dataset was generated at Los Alamos National Laboratory (LANL) and approved for public release under number LA-UR-25-26282.

How to Run the Scripts

The experiments are designed to be run from specific subdirectories within the data/ directory. Each subdirectory contains a run.sh script that executes the main Python scripts with the correct arguments. All the results shown in the data directory are for 2Jmax=6 case. You can generate the same data for other 2Jmax values by changing the arguments passed to python scripts in run.sh file.

General Workflow:

Navigate to the relevant subdirectory inside data/ (e.g., cd data/2D_pareto/).
Execute the shell script: sh run.sh.
The script will call the main Python script from the parent directory.
Output files (e.g., .npy matrices or .csv results) will be generated and saved within that same subdirectory.
Jupyter notebooks can then be used to plot the figures shown in the paper based on the generated results.

Directory Structure and Script Correspondence

The main scripts are located in the root directory. They are executed from the subdirectories within data/, which is also where their outputs are saved.

data/numpy_matrices_for_fitting/
- Corresponding Script: get_matrices_for_fitting.py
- Purpose: This directory is for generating the descriptor matrices needed for model fitting.
- run.sh executes: get_matrices_for_fitting.py.
- Output: Saves descriptor matrices and target vectors as .npy files within this directory.
data/leverage_scores_dataframe/
- Corresponding Script: get_leverage_scores_data.py
- Purpose: This directory is for calculating the leverage scores from the previously generated descriptor matrices.
- run.sh executes: get_leverage_scores_data.py.
- Input: Reads the .npy matrices from data/numpy_matrices_for_fitting/.
- Output: Saves leverage score data as df_leevrage.csv file within this directory.
data/sampling_methods_comparison/N where N is the number of configurations to sample from the total data set of 20,000 configurations
- Corresponding Script: get_sampling_performance_data.py
- Purpose: To generate performance data by training models on subsets of data selected via leverage, block leverage, and random sampling.
- run.sh executes: get_sampling_performance_data.py.
- Input: Reads leverage scores from data/leverage_scores_dataframe/.
- Output: Saves the performance results in a results.csv file inside this directory. This data is visualized using the plot_sampling_performance.ipynb notebook to create Figure 6.
data/2D_pareto/
- Corresponding Script: get_2D_pareto_data.py
- Purpose: To generate data for the 2D Pareto front by fitting models with different energy/force weights.
- run.sh executes: get_2D_pareto_data.py.
- Input: Reads descriptor matrices from data/numpy_matrices_for_fitting/.
- Output: Saves the energy/force RMSE results in a results.csv file here. This data is used by plot_2D_pareto.ipynb to create Figures 4 and 5.
data/3D_pareto/
- Corresponding Script: get_3D_pareto_data.py
- Purpose: To perform a large-scale factorial experiment by training and testing MLIPs at varying subset sizes, energy/force weights, model complexities, and DFT precisions.
- run.sh executes: get_3D_pareto_data.py.
- Input: Reads all necessary matrices and leverage scores from other data subdirectories.
- Output: Saves the comprehensive results in a results.csv file in this directory. The plot_3D_pareto.ipynb notebook uses this data for the Pareto analysis in Figures 7 and 8 of the main paper and Figure 7 of SI.

Python Scripts and Notebooks

helper_functions.py: Contains various utility functions (e.g., for data loading, error calculation) that are imported and used by the main scripts and notebooks.
get_matrices_for_fitting.py: Generates descriptor matrices and corresponding energy/force vectors from the reference dataset. It takes 2Jmax and precision level as arguments.
perform_one_fit.py: Handles the training of a single qSNAP model. It reads the matrices generated by get_matrices_for_fitting.py, sets the energy/force weights, and performs linear regression and then prints training and testing RMSE values for energy and forces.
get_leverage_scores_data.py: Calculates leverage scores based on a descriptor matrix.
get_sampling_performance_data.py: Performs subsampling and fitting to compare random, leverage, and block leverage sampling methods.
get_2D_pareto_data.py: Runs a series of fits with different energy/force weights to generate data for 2D Pareto plots.
get_3D_pareto_data.py: Trains MLIPs in a full factorial design to generate data for the multi-objective Pareto analysis.
Jupyter Notebooks (plot_*.ipynb): These are used for visualizing the data generated by the scripts and creating the figures presented in the paper.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DFTprecisionMLIP

Dependencies

Dataset

How to Run the Scripts

Directory Structure and Script Correspondence

Python Scripts and Notebooks

About

Uh oh!

Releases 2

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
data		data
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
get_2D_pareto_data.py		get_2D_pareto_data.py
get_3D_pareto_data.py		get_3D_pareto_data.py
get_leverage_scores_data.py		get_leverage_scores_data.py
get_matrices_for_fitting.py		get_matrices_for_fitting.py
get_sampling_performance_data.py		get_sampling_performance_data.py
helper_functions.py		helper_functions.py
perform_one_fit.py		perform_one_fit.py
plot_2D_pareto.ipynb		plot_2D_pareto.ipynb
plot_3D_pareto.ipynb		plot_3D_pareto.ipynb
plot_sampling_performance.ipynb		plot_sampling_performance.ipynb
requirements.txt		requirements.txt

License

IlgarBaghishov/DFTprecisionMLIP

Folders and files

Latest commit

History

Repository files navigation

DFTprecisionMLIP

Dependencies

Dataset

How to Run the Scripts

Directory Structure and Script Correspondence

Python Scripts and Notebooks

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Uh oh!

Languages

Packages