Skip to content

maxjr82/PCA-for-WS22

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

Principal Component Analysis for the WS22 datasets

This repository contains a custom Python script designed to perform dimensionality reduction analysis for the molecular geometries stored in the datasets of the WS22 database hosted in the ZENODO repository (https://doi.org/10.5281/zenodo.6985377).

The script works in three steps:

First, a built-in function is used to convert the Cartesian coordinates of the molecular geometries into a pairwise distance descriptor of size $N_{atoms} * (N_{atoms} - 1)/2$ containing all unique atom-atom distances. Then, this descriptor is scaled by using the MinMax approach. Finally, the rescaled data is passed as input to a PCA method that is used to project the high-dimensional descriptor into a compact 2D representation for visualization purposes.

Requirements

To run this script, the following packages should be installed:

  • python3 (tested with version 3.8.6)
  • glob
  • numpy
  • pandas
  • sklearn

How to use

After downloading the desired NPZ datasets from the ZENODO repository to a local directory, one can run the script directly from a Linux terminal as follows:

python dimred.py

The output of the script is a zipped csv file containing two columns storing the calculated principal components for each molecular dataset together with an additional column with the corresponding labels for the molecular conformations taken from the original datasets.

About

Custom script for dimension reduction of the WS22 datasets.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages