This repo stores the code implementing anomaly detection techniques on a given dataset. This has been done as part of my Bachelor's thesis in 2024. Here is a relevant paper discussing the topic: Deep Learning for Time Series Anomaly Detection: A Survey.
The thesis is focused on studying the general problem of anomaly detection, implementing anomaly detection algorithms and applying them on real-world data provided by a private company. Then the results of the various algorithms employed are compared and discussed.
Given an industrial machine, it is possible to detect deviations in the collected data which could indicate machine degradation or imminent malfunction. The the goal of Anomaly Detection algorithms is to exactly learn when deviations happen and detect them in real-time.
These techniques can be a useful tool for optimizing the resilience of industrial processes, for example, by enabling the prediction of necessary maintenance interventions for industrial systems. These techniques not only allow for the precise identification of such deviations, but also enable dynamic responses to the emergence of previously unseen anomalies.
The field has particularly expanded its application prospects in the domain of time series, where data can follow particularly complex patterns.
The time_series.ipynb file provides the code used for the implementation on the first section and the results on different types of unbalanced datasets for the next sections.
Programming Language: Python 3.13+
Libraries used for implementing the algorithms: Scikit-learn, Tensorflow, DeepOD, kneed, lof_autotuner.
The code provided in the time_series.ipynb file can be run on Colab, Anaconda or locally as a .py file:
python time_series.py <FILEPATH> <FILENAME> <FILENAME_TRAIN> <NUM_COMPONENTS>
Where:
Parameter | Description |
---|---|
<FILEPATH> |
Path to the directory containing the datasets |
<FILENAME> |
Name of the test file (retrieved as <FILEPATH>/Noise/<FILENAME>.csv ) |
<FILENAME_TRAIN> |
Name of the training file (retrieved as <FILEPATH>/Training e Test/<FILENAME_TRAIN>.csv ) |
<NUM_COMPONENTS> |
Number of features in the dataset |
python time_series.py ./data machine_test machine_train 10
The datasets should be structured as follows:
Required Format:
- First
NUM_COMPONENTS - 1
columns: Feature data - Last column: Ground truth labels (0 = normal, 1 = anomaly)
Directory Structure:
<FILEPATH>/
├── Noise/
│ └── <FILENAME>.csv # Test dataset
└── Training and Test/
└── <FILENAME_TRAIN>.csv # Training dataset
Example CSV Structure:
feature_1,feature_2,feature_3,...,feature_n,label
1.2,0.5,2.1,...,0.8,0
0.9,1.3,1.7,...,1.2,1
...
The ground truth labels in both datasets are used to evaluate the performance metrics of the anomaly detection algorithms.
Some comments are in Italian since the thesis is written for a Italian university.
The content of the thesis will be re-adapted in a dedicated blogpost as a educational contribution.
Here is a list of papers and other resources which have been studied before implementing the code. The papers span different topics such as curse of dimensionality and dimensionality reduction techniques (PCA, t-SNE, UMAP), ML and DL algorithms for anomaly detection.
- L.J.P. van der Maaten, E.O. Postma, and H.J. van den Herik. Dimensionality Reduction: A Comparative Review. Technical Report TiCC-TR 2009-005, Tilburg University, 2009.
- Jonathon Shlens. A Tutorial on Principal Component Analysis. Computing Research Repository (CoRR), abs/1404.1100, 2014.
- Quan Wang. Kernel Principal Component Analysis and its Applications in Face Recognition and Active Shape Models, 2014.
- CMU Barnabàs Pòzcos. Manifold Learning.
- Geoffrey E Hinton and Sam Roweis. Stochastic Neighbor Embedding. volume 15. MIT Press, 2002.
- Martin Wattenberg, Fernea Viégas, and Ian Johnson. How to Use t-SNE Effectively. Distill, 2016.
- Laurens van der Maaten and Geoffrey Hinton. Visualizing Data Using t-SNE. Journal of Machine Learning Research, 2008.
- Leland McInnes. How UMAP Works, 2018.
- Leland McInnes, John Healy, and James Melville. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. The Journal of Open Source Software (JOSS), 2020.
- Andy Coenen and Adam Pearce. Understanding UMAP.
- Dmitry Kobak and George Linderman. Initialization is critical for preserving global data structure in both t-SNE and UMAP. Nature Biotechnology, 2021.
- Mark Schwabacher, Nikunj Oza, and Bryan Matthews. Unsupervised Anomaly Detection for Liquid-Fueled Rocket Propulsion Health Monitoring. Technical Report NASA/TP-2009-214228, NASA Ames Research, 2009.
- Arthur Zimek, Erich Schubert, and Hans-Peter Kriegel. A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2012.
- M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. AAAI Press, 1996.
- Markus Breunig, Peer Kröger, Raymond Ng, and Joerg Sander. LOF: Identifying Density-Based Local Outliers.
- Stephen Howard. The Elliptical Envelope. 2007.
- Fei Tony Liu, Kai Ting, and Zhi-Hua Zhou. Isolation Forest.
- David Tax and Robert Duin. Support Vector Data Description.
- Zahra et al. Zamanzadeh Darban. Deep Learning for Time Series Anomaly Detection: A Survey. Association For Computing Machinery, 2022.
- Hongzuo Xu, Guansong Pang, Yijie Wang, and Yongjun Wang. Deep isolation forest for anomaly detection. IEEE Transactions on Knowledge and Data Engineering, 2023.
- Lukas et al. Ruff. Deep One-Class Classification. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research. PMLR, 2018.
- Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, Cambridge, MA, 2012.
- Giuliano Mazzanti and Valter Roselli. Appunti di Algebra lineare, Geometria analitica, Tensori. Pitagora, Bologna, 2013. (italian)
- Yanzhao Jhu. Deep Learning and Information Theory, 2017.
- Satya Kumar Vadlamani. Automatic-Local-Outlier-Factor-Tuning.
- Zekun Xu, Deovrat Kakde, and Arin Chaudhuri. Automatic Hyperparameter Tuning Method for Local Outlier Factor, with Applications to Anomaly Detection. IEEE, 2019.