Skip to content

❗ This is modified local version of Auto-QChem: an automated workflow for the generation and storage of DFT calculations for organic molecules.

License

Notifications You must be signed in to change notification settings

dkesada/auto-qchem-local

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

logo

Introduction

❗ This is an modified local version of Auto-QChem where I have taken down all components related with the Slurm scheduler, MongoDB and cloud management so that it can be run locally only by installing the Python library from GitHub. I've simplified the package as much as possible so that the relevant functions that create .gjf files, extract Gaussian .log properties, calculate Morfeus descriptors and generate .csv datasets can be accessed from a simple api defined as an object. Once the user instantiates this object, all functionality can be accessed easily.

Keep in mind that this also means that all concurrent computations are now performed sequentially locally unless you define them otherwise. With that said, the only functions that are time consuming are the Morfeus calculations. Depending on the molecule and the number of conformers, this can take quite a while. For example, calculating the descriptors of a molecule with 90 atoms and a metal component with 5 conformers can take around 15-20 minutes on an average machine, while calculating this same molecule on 1 or 2 conformers can take up to 3-4 minutes. Simpler molecules without metal atoms will take less time and will have less failures in the conformer optimization of Morfeus.

The idea of this fork is to provide the possibility to perform their own calculations to any lab that wants to do them, even if they do not have a cluster infrastructure.

Installation

You will need to install this autoq-chem version from GitHub with the following command:

pip install git+https://github.com/dkesada/auto-qchem_exp.git

Also, beware that Morfeus calculations only work on Linux (and maybe MacOS, but I haven't tried it) machines, because they need the xtb handler for Python, and that is only available in Linux. This will not work on Windows natively, but it can be used via the Windows subsystem for Linux (WSL), so it is also possible to install it in Windows machines with this intermediate layer. Alternatively, Docker is always a possible option for these kind of situations. To install the xtb python handler in Linux, run the following command (I use conda, but pip could also be used):

conda install xtb-python --channel conda-forge

After this, the package should be ready to generate datasets. More information on the installation of xtb can be found here and here if needed.

OpenBabel issues

For now, and for quite a while now, OpenBabel does not like being installed with pip in Linux due to some versioning bug. As such, I cannot add it in the install_requires, because then pip will fail installing the package from GitHub. To solve this, one needs to have OpenBabel already installed in their environment for the package to work. To do this, we need to first install the binaries (I'll assume we are on Linux, if not instructions can be found here)

sudo apt-get install openbabel

And then we can use conda to install it (there's no other option for now, it's either conda or suffering):

conda install openbabel

I'll update the package requirements if I'm ever able to install OpenBabel in Linux using pip. But for now, it is what it is.

Code structure

The main functionality of the package is separated into different objects inside the api module: the GjfGenerator class that controls the generation of Gaussian input files, the MorfeusGenerator class that controls the morfeus calculations, the LogExtractor class that controls the extraction of information out of Gaussian .log files and the AutoChem controller class that serves as the main entry point to the package.

Usage examples

In the markdowns folder there are some examples on how to use the package as a script tool with argparse. The main_api.py shows how to use the AutoChem class as the main entry point to the api of the package. Each of these components can be used independently, but I would say that using only the AutoChem class is the easiest way to use the package.

As for a full example, we have prepared some files and folders to showcase how to use this package. Inside the markdowns folder, there is the example directory with some example files and folders that we will use in the following sections.

Input .gjf files generation

Let's start with the generation of .gjf files. For this, the only thing we require is a single .smi file with a SMILES code per line for each of the compounds we want to analyze with Gaussian. An example .smi file is stored here. In this case, we would have a folder like this:

gjf_1

Then, we only need to use the AutoChem class to generate the .gjf files from this .smi file:

from autoqchem_local.api.api import AutoChem

# Instantiate the AutoChem object
controller = AutoChem(log_to_file=True)

controller.generate_gjf_files(path_to_smi_file)

This will generate a separate .gjf and .smi file for each SMILES in the original .smi file inside a new ./output_gjf/ directory:

gjf_1

This generated .gjf files will be named after the InChI code of each molecule, because SMILES codes have characters that cannot be in file names. This files can now be inputed into Gaussian for calculation. The other .smi files with the same names as the .gjf files contain the SMILES code of each molecule and will be needed for the morfeus calculation part. Additionally, if the log_to_file parameter is set to True, a log file will be generated with all relevant execution information of the controller object.

Full dataset generation

The dataset generation can be done all in one single function call or each part can be done individually. To begin this process, we need all .smi files and .log files if available in the same folder and with the same names. Additionally, a .xyz file with the coordinates of a conformer can also be present for each molecule. All files for the same molecule need to have the same name so that they can be joined in the same row in the final dataset.

To run the full pipeline in one call, we use the following code:

controller.generate_dataset(data_dir=path_to_folder, gaussian=True)

This generates intermediate files and eventually returns the full_dataset.csv file with all information of both the .log files if available and the morfeus properties of the molecules

full

All intermediate steps can be performed independently if so desired with the other functions of the AutoChem class.

Morfeus calculation

To calculate the morfeus properties of some molecules, we need the individual .smi files for each of the molecules inside a directory (they can be stored in further subdirectories inside, the controller will look for .smi files recursively through the dir tree) and optionally the .log and .xyz files with the same names as the .smi files.

morf_1

With this, we can use the AutoChem object to calculate the morfeus properties for each molecule first and then we join all intermediate .csv files into a single one. Please, bear in mind that this is the most computationally expensive process (other than using Gaussian for calculations, but that is outside the scope of this package), and so it can take quite a while:

# Calculate morfeus properties
controller.process_morfeus(data_dir=path_to_folder)

This method creates a .csv file for each processed molecule with its morfeus properties. Afterwards, we can join all of them together into a single table. This table can be your last step if you do not want to process .log files:

morf_2

# Join all morfeus .csv separate files into a single one
controller.join_morfeus_csv_files(data_dir=path_to_folder)

morf_3

.log file extraction

We can join all the extracted information from different .log files into a single .csv with a single function call:

controller.process_log_files(data_dir=path_to_folder, output_path=path_to_folder)

log

Merging all files

In this last step, we merge both the morfeus calculations and the log extractions into a single pandas dataframe and save it to a .csv file, obtaining the same result as with the generate_dataset() function:

# Join both morfeus and log files
res = controller.join_log_and_morfeus(log_dir=f'{path_to_folder}log_values.csv',
                                      morfeus_dir=f'{path_to_folder}morfeus_values.csv')
                                
# Store the dataframe as the final .csv file
res.reset_index(drop=True, inplace=True)
res.to_csv(f'{path_to_folder}full_dataset.csv', index=False)                           

full

Standalone script

If, rather than using the package as a Python module, one prefers using this functionality as a standalone bash script, there is an example on the main_api.py file on how to define it. This file could be used as an entry point to the package functionality through simple command prompt calls using the argparse module.

About

❗ This is modified local version of Auto-QChem: an automated workflow for the generation and storage of DFT calculations for organic molecules.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.7%
  • Dockerfile 1.3%