Introduction

The thesis is focused on studying the general problem of anomaly detection, implementing anomaly detection algorithms and applying them on real-world data provided by a private company. Then the results of the various algorithms employed are compared and discussed.

What is anomaly detection

Given an industrial machine, it is possible to detect deviations in the collected data which could indicate machine degradation or imminent malfunction. The the goal of Anomaly Detection algorithms is to exactly learn when deviations happen and detect them in real-time.

These techniques can be a useful tool for optimizing the resilience of industrial processes, for example, by enabling the prediction of necessary maintenance interventions for industrial systems. These techniques not only allow for the precise identification of such deviations, but also enable dynamic responses to the emergence of previously unseen anomalies.

The field has particularly expanded its application prospects in the domain of time series, where data can follow particularly complex patterns.

Code

The time_series.ipynb file provides the code used for the implementation on the first section and the results on different types of unbalanced datasets for the next sections.

Tools used

Programming Language: Python 3.13+

Libraries used for implementing the algorithms: Scikit-learn, Tensorflow, DeepOD, kneed, lof_autotuner.

Usage

The code provided in the time_series.ipynb file can be run on Colab, Anaconda or locally as a .py file:

python time_series.py <FILEPATH> <FILENAME> <FILENAME_TRAIN> <NUM_COMPONENTS>

Where:

Parameter	Description
`<FILEPATH>`	Path to the directory containing the datasets
`<FILENAME>`	Name of the test file (retrieved as `<FILEPATH>/Noise/<FILENAME>.csv`)
`<FILENAME_TRAIN>`	Name of the training file (retrieved as `<FILEPATH>/Training e Test/<FILENAME_TRAIN>.csv`)
`<NUM_COMPONENTS>`	Number of features in the dataset

Example

python time_series.py ./data machine_test machine_train 10

Dataset Structure

The datasets should be structured as follows:

Required Format:

First NUM_COMPONENTS - 1 columns: Feature data
Last column: Ground truth labels (0 = normal, 1 = anomaly)

Directory Structure:

<FILEPATH>/
├── Noise/
│   └── <FILENAME>.csv          # Test dataset
└── Training and Test/
    └── <FILENAME_TRAIN>.csv    # Training dataset

Example CSV Structure:

feature_1,feature_2,feature_3,...,feature_n,label
1.2,0.5,2.1,...,0.8,0
0.9,1.3,1.7,...,1.2,1
...

The ground truth labels in both datasets are used to evaluate the performance metrics of the anomaly detection algorithms.

Comments

Some comments are in Italian since the thesis is written for a Italian university.

Discussion

The content of the thesis will be re-adapted in a dedicated blogpost as a educational contribution.

Other useful resources

Here is a list of papers and other resources which have been studied before implementing the code. The papers span different topics such as curse of dimensionality and dimensionality reduction techniques (PCA, t-SNE, UMAP), ML and DL algorithms for anomaly detection.

Curse of Dimensionality and Dimensionality Reduction

L.J.P. van der Maaten, E.O. Postma, and H.J. van den Herik. Dimensionality Reduction: A Comparative Review. Technical Report TiCC-TR 2009-005, Tilburg University, 2009.
Jonathon Shlens. A Tutorial on Principal Component Analysis. Computing Research Repository (CoRR), abs/1404.1100, 2014.
Quan Wang. Kernel Principal Component Analysis and its Applications in Face Recognition and Active Shape Models, 2014.
CMU Barnabàs Pòzcos. Manifold Learning.
Geoffrey E Hinton and Sam Roweis. Stochastic Neighbor Embedding. volume 15. MIT Press, 2002.
Martin Wattenberg, Fernea Viégas, and Ian Johnson. How to Use t-SNE Effectively. Distill, 2016.
Laurens van der Maaten and Geoffrey Hinton. Visualizing Data Using t-SNE. Journal of Machine Learning Research, 2008.
Leland McInnes. How UMAP Works, 2018.
Leland McInnes, John Healy, and James Melville. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. The Journal of Open Source Software (JOSS), 2020.
Andy Coenen and Adam Pearce. Understanding UMAP.
Dmitry Kobak and George Linderman. Initialization is critical for preserving global data structure in both t-SNE and UMAP. Nature Biotechnology, 2021.

Machine Learning Algorithms for Anomaly Detection

Mark Schwabacher, Nikunj Oza, and Bryan Matthews. Unsupervised Anomaly Detection for Liquid-Fueled Rocket Propulsion Health Monitoring. Technical Report NASA/TP-2009-214228, NASA Ames Research, 2009.
Arthur Zimek, Erich Schubert, and Hans-Peter Kriegel. A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2012.
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. AAAI Press, 1996.
Markus Breunig, Peer Kröger, Raymond Ng, and Joerg Sander. LOF: Identifying Density-Based Local Outliers.
Stephen Howard. The Elliptical Envelope. 2007.
Fei Tony Liu, Kai Ting, and Zhi-Hua Zhou. Isolation Forest.
David Tax and Robert Duin. Support Vector Data Description.

Deep Learning Algorithms for Anomaly Detection

Zahra et al. Zamanzadeh Darban. Deep Learning for Time Series Anomaly Detection: A Survey. Association For Computing Machinery, 2022.
Hongzuo Xu, Guansong Pang, Yijie Wang, and Yongjun Wang. Deep isolation forest for anomaly detection. IEEE Transactions on Knowledge and Data Engineering, 2023.
Lukas et al. Ruff. Deep One-Class Classification. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research. PMLR, 2018.

Others

Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, Cambridge, MA, 2012.
Giuliano Mazzanti and Valter Roselli. Appunti di Algebra lineare, Geometria analitica, Tensori. Pitagora, Bologna, 2013. (italian)
Yanzhao Jhu. Deep Learning and Information Theory, 2017.
Satya Kumar Vadlamani. Automatic-Local-Outlier-Factor-Tuning.
Zekun Xu, Deovrat Kakde, and Arin Chaudhuri. Automatic Hyperparameter Tuning Method for Local Outlier Factor, with Applications to Anomaly Detection. IEEE, 2019.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
time_series.ipynb		time_series.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Anomaly Detection Thesis

Table of Contents

Introduction

What is anomaly detection

Code

Tools used

Usage

Example

Dataset Structure

Comments

Discussion

Other useful resources

Curse of Dimensionality and Dimensionality Reduction

Machine Learning Algorithms for Anomaly Detection

Deep Learning Algorithms for Anomaly Detection

Others

About

Uh oh!

Releases

Packages

Languages

Remdox/Anomaly-Detection-Thesis

Folders and files

Latest commit

History

Repository files navigation

Anomaly Detection Thesis

Table of Contents

Introduction

What is anomaly detection

Code

Tools used

Usage

Example

Dataset Structure

Comments

Discussion

Other useful resources

Curse of Dimensionality and Dimensionality Reduction

Machine Learning Algorithms for Anomaly Detection

Deep Learning Algorithms for Anomaly Detection

Others

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages