Microsoft Malware Classification Challenge

Introduction

In this project, we developed a machine learning model to classify assembly code and byte files into their respective malware families. The dataset we used for this project consisted of 200GB of data, with 50GB of .bytes files and 150GB of .asm files. There were a total of 10,868 .bytes files and 10,868 .asm files, for a total of 21,736 files. Our dataset contained 9 different types of malware, including Ramnit, Lollipop, Kelihos_ver3, Vundo, Simda, Tracur, Kelihos_ver1, Obfuscator.ACY, and Gatak.

The goal of this project was to develop a solution that could help identify and prevent the spread of malware. Malware, short for malicious software, refers to any software that is designed to harm or exploit a computer system without the user's knowledge or consent. This can include viruses, worms, Trojans, ransomware, and other types of malicious programs. By developing a machine learning model that can accurately classify these files, we aimed to help organizations like Microsoft (which runs its anti-malware utilities on over 150 million computers worldwide) identify and prevent the spread of malware.

Types of Malware:

Ramnit: Steals sensitive information and gives a malicious hacker access and control of the infected computer.
Lollipop: Adware program that shows ads, redirects search engine results, monitors user actions, downloads applications, and sends information about the computer to a hacker.
Kelihos_ver1: Kelihos trojan that spreads through networks and carries out harmful activities on infected computers.
Kelihos_ver3: Trojan family that distributes spam email messages containing hyperlinks to installers of the Kelihos malware.
Vundo: Trojan family that can redirect web searches and display unwanted ads
Simda: Family of password-stealing trojans that can give a malicious hacker backdoor access and control to a system. They can then steal passwords and gather information about the system.
Tracur: Trojan family that can redirect web searches, display unwanted ads, and download and run other malware.
Obfuscator.ACY: Malware family designed to evade detection by security software
Gatak: Malware family that steals sensitive information and gives a malicious hacker access and control of the infected computer

Problem Statement

Given a dataset of assembly codes and byte files, our goal was to train a machine learning model to accurately classify the files into their respective malware families. The model should be able to classify new, unseen files with high accuracy, and should also provide probability estimates. The goal was to develop a solution that could help identify and prevent the spread of malware.

Business Constraints

The model should have a high accuracy rate in order to be effective at detecting and preventing the spread of malware.
The model should be able to process and classify large numbers of files quickly and efficiently, while operating within the constraints of available computing resources.

Machine Learning Objective

The objective of this machine learning project was to develop a model that could accurately classify assembly code and byte files into their respective malware families and predict the probability of each data point belonging to each of the nine classes.

Constraints

Class probabilities were needed.
Errors in class probabilities should be penalized.

Metrics

Primary Metric: Log loss
Secondary Metric: Confusion Matrix

Approach

Preprocessing: The data was preprocessed by converting the .asm files into image files and extracting features using unigram and bigram approaches on hexadecimal pairs, as well as by extracting commands, headers, and pixel features. Feature selection techniques using chi2, annova, and mutual info scores were applied to reduce the number of features from 140000 to 1900.
Modeling: Several machine learning models were trained using the preprocessed data, including lightgbm. The models were optimized using hyperopt (bayesian optimization) to find the best parameters.
Evaluation: The performance of the models was evaluated using log loss and the confusion matrix as the primary and secondary metrics.

Feature Extraction

The dataset consists of 200 GB of data, with 50 GB of .bytes files and 150 GB of .asm files. We use multiprocessing to extract the features from these files.

Feature Extraction on .bytes Files

We use a custom feature extractor to read through the 50 GB of .bytes data using multiprocessing. We run unigram and bigram on sequences of hexadecimal strings, such as D8 4C 5B 67 85 C0 75 3D 68 9C 00 00 00 E8 3E 1C, where each hexadecimal string can be considered as a single word. We used a custom feature extractor to read through 50GB of .bytes data using multiprocessing. We then ran unigram and bigram on sequences of hexadecimal pairs in the .bytes files. In addition to ngram features, we also extract file size features by calculating the size of each .bytes file in mega bytes.

Feature Extraction on .asm Files

We divided each line of an .asm file into four parts: header, memory address, data, instructions, and comments.

.text:675A13BE 68 8C 40 5B 67 push offset off_675B408C ; CODE XREF: sub_675A12B5+1Dp

We use regular expressions to identify and separate the following elements in each line:

Header: .text
Memory address: 675A13BE
Data: 68 8C 40 5B 67
Instructions: push offset off_675B408C
Comments: ; CODE XREF: sub_675A12B5+1Dp

We extract the count of each of the headers, data, and instructions for every file.

Feature Selection

The main goal of the EDA was to identify important features, as there were around 140,000 features in the dataset. To do this, we used techniques such as t-SNE to visualize the data and identify patterns and clusters. We also used the SelectKBest function from sklearn to select the top performing features based on criteria such as chi2, f_classif, and mutual_info_classif.

For each feature set, including the bytes unigram bag of words, bytes bigram bag of words, assembly pixel intensity features, assembly unigram bag of words, assembly command/instruction features, and assembly header features, we plotted t-SNE to compare the original features with the best and remaining features. By using the minimum number of features that gave well-separated t-SNE plots, we were able to identify the most important features for each set.

We then combined the top features from each set to create the final feature set. The number of features selected from each set was as follows:

Size features: 5
Bytes unigram bag of words: 257
Bytes bigram bag of words: 137
Assembly header features: 44
Assembly unigram bag of words: 256
Assembly command/instruction features: 160
Assembly pixel intensity: 1000

Overall, the efforts put into feature engineering and feature reduction were successful, resulting in well-separated clusters forming in two dimensions.

Random Model

In order to establish a baseline for comparison, we evaluated the performance of a random model. The log loss of a random model is 2.2, which means that the log loss of our machine learning models should be less than 2.2 in order to be considered an improvement.

Machine Learning Models

We evaluated the performance of several machine learning models for this classification task, including K Nearest Neighbors, Logistic Regression, Random Forest, XGBoost, and LightGBM.

Of all the models tested, LightGBM performed the best, with a log loss of 0.015, accuracy of 99.88%, micro and macro F1 score of close to 1.

There precision and recall scores of all the classes were between 0.98 to 1.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.obsidian		.obsidian
info_data		info_data
model		model
modelling		modelling
modelling_results		modelling_results
selected_features		selected_features
.gitignore		.gitignore
1_Feature_Extraction.ipynb		1_Feature_Extraction.ipynb
2_Exploratory_Data_Analysis.ipynb		2_Exploratory_Data_Analysis.ipynb
3_Feature_Reduction.ipynb		3_Feature_Reduction.ipynb
4_Modelling.ipynb		4_Modelling.ipynb
Documentation.MD.md		Documentation.MD.md
README.md		README.md
trainLabels.csv		trainLabels.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Microsoft Malware Classification Challenge

Introduction

Types of Malware:

Problem Statement

Business Constraints

Machine Learning Objective

Constraints

Metrics

Approach

Feature Extraction

Feature Extraction on .bytes Files

Feature Extraction on .asm Files

Feature Selection

Random Model

Machine Learning Models

About

Uh oh!

Releases

Packages

Languages

arun-kumar-c-s/Microsoft-Malware-Classification-Challenge

Folders and files

Latest commit

History

Repository files navigation

Microsoft Malware Classification Challenge

Introduction

Types of Malware:

Problem Statement

Business Constraints

Machine Learning Objective

Constraints

Metrics

Approach

Feature Extraction

Feature Extraction on .bytes Files

Feature Extraction on .asm Files

Feature Selection

Random Model

Machine Learning Models

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages