anomaly-detection-in-log-files

Introduction

Computer-generated records, commonly known as logs, capture timestamped data as it relates to actions and decisions taken by applications, operating systems and devices. Businesses leverage this data to ensure that their applications and tools are fully operational and secure. In this project, and as part of our Capstone project for the Master of Applied Data Science, will explore anomaly detection in application log data. This will be done similarly to the paper Anomaly Detection for Application Log Data[1], but our approach will use generalized feature extraction without using any log file specific parsers.

Approach

The question we will explore is how well does this generalized feature extraction work? That is, how do models trained on the generalized feature extraction output compare to models trained on the output of customized feature extraction used in other research studies.

Our approach is divided into two parts:

Supervised Learning: framed as a classification task, we trained a logistic regression model, a gradient boosted tree, and an XGBoost model to classify log lines as normal or anomalous.
Unsupervised Learning: in an attempt to detect anomalies in the log files, we made use of the K-means clustering algorith, one class SVM, and Isolation forest.

Data

BGL and Thunderbird data were used in this project. The data is provided by the Loghub collection:

Shilin He, Jieming Zhu, Pinjia He, Michael R. Lyu. Loghub: A Large Collection of System Log Datasets towards Automated Log Analytics. Arxiv, 2020.

Information on BGL can be found here
Note: the data in the data folder is just a sample. The raw logs can be requested from Zenodo: https://doi.org/10.5281/zenodo.1144100

Information on Thunderbird can be found here
Note the data in the data folder is just a sample. The raw logs can be requested from Zenodo: https://doi.org/10.5281/zenodo.1144100

[1] Grover, Aarish, "Anomaly Detection for Application Log Data" (2018). Master's Projects. 635. DOI: https://doi.org/10.31979/etd.znsb-bw4d

How to run the code

ad_feature_extraction.py preprocesses the data and generates the features used in the supervised and unsupervised machine learning sections of the project. The input to this file is the data listed above. Since generating this data takes a long time, you can download the generated files here.
Supervised_learning folder contains the jupyter notebooks that detail the various supervised learning models we ran during the course of the project.
unsupervised_learning folder contains jupyter notebooks that detail the various unsupervised learning models we ran during the course of the project.
F1 Scores.ipynb evaluates the performance of the previously selected models under various conditions
images contains some of the visualizations and the code to generate them data contains samples of the processed data. To get the full dataset, run the ad_feature_extraction.py on the raw dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Supervised_learning		Supervised_learning
data		data
images		images
unsupervised_learning		unsupervised_learning
.DS_Store		.DS_Store
F1 Scores.ipynb		F1 Scores.ipynb
Poster-Anomaly Detection in logs.pdf		Poster-Anomaly Detection in logs.pdf
README.md		README.md
SIADS 699 Capstone Report Final.pdf		SIADS 699 Capstone Report Final.pdf
ad_feature_extraction.py		ad_feature_extraction.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

anomaly-detection-in-log-files

Introduction

Approach

Data

How to run the code

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

manelmahroug/anomaly-detection-in-log-files

Folders and files

Latest commit

History

Repository files navigation

anomaly-detection-in-log-files

Introduction

Approach

Data

How to run the code

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages