Skip to content

ParadoxicalNerd/merck-anomaly-detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Anomaly Detection

This library helps with detecting anomalies in a dataset with Apple Watch data. To detect anomalies, we use reduce dimensionality using Principal Component Analysis, then use Mahalanobis distance to detect the outliers, and finally use standard deviation to calculate the threshold. This library uses an SQL database to store data.

The library consists of 3 files:

  • SQL_Interface.py: Helps with creating a connection to a test database. In production, replace with actual database that will be used. Returns a connection and a cursor to execute the queries.
  • XML_to_SQL.py: Reads multiple Apple Health kit exports (zip files), extracts the required file in-memory, and adds the user data to the database
  • Healthkit.py: The main crux of the project; it reads in the data from the database and figures out the metric for anomaly detection.

This library was developed for Merck in collaboration with Purdue Data mine.

Installation

Simply clone the current branch in the git repo:

git clone -b anomaly_detection https://github.com/ParadoxicalNerd/datamine-merck-biometrics-ds.git

Then install the requirements

pip install -r requirements.txt

Usage

Change the dataset path to point to a folder with a structure like this (the number in the parenthesis will be the assigned user id):

dataset
├── export (0).zip
├── export (1).zip
├── export (2).zip
└── export (3).zip

An example implementation of the library can be seen in example.py.

Note: You need a web browser to see the 3d plot generated by example.py for your data. An example of the plot generated can be seen in scatter_plot.html

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update README as appropriate.

Resources

Anomaly detection overview: Describes the overview of the procedure needed to conduct anomaly detection

Anomaly detection code: Provides a nice overview of the code we need to conduct anomaly detection

PCA: View unsupervised Learning chapter in “Introduction to Machine Learning with Python: A Guide for Data Scientists” by O'Riley

Mahalanobis Distance Math: Explains the math behind Mahalanobis distance and why we use it

Mahalanobis Distance SciPy:This resource talks about the SciPy Mahalanobis distance module

Contact

Pankaj Meghani — meghanipankaj5@gmail.com

About

Anomaly detection library that I created for Merck Pharma via the Purdue Datamine program

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages