In this project, we developed a machine learning model to classify assembly code and byte files into their respective malware families. The dataset we used for this project consisted of 200GB of data, with 50GB of .bytes files and 150GB of .asm files. There were a total of 10,868 .bytes files and 10,868 .asm files, for a total of 21,736 files. Our dataset contained 9 different types of malware, including Ramnit, Lollipop, Kelihos_ver3, Vundo, Simda, Tracur, Kelihos_ver1, Obfuscator.ACY, and Gatak.
The goal of this project was to develop a solution that could help identify and prevent the spread of malware. Malware, short for malicious software, refers to any software that is designed to harm or exploit a computer system without the user's knowledge or consent. This can include viruses, worms, Trojans, ransomware, and other types of malicious programs. By developing a machine learning model that can accurately classify these files, we aimed to help organizations like Microsoft (which runs its anti-malware utilities on over 150 million computers worldwide) identify and prevent the spread of malware.
- Ramnit: Steals sensitive information and gives a malicious hacker access and control of the infected computer.
- Lollipop: Adware program that shows ads, redirects search engine results, monitors user actions, downloads applications, and sends information about the computer to a hacker.
- Kelihos_ver1: Kelihos trojan that spreads through networks and carries out harmful activities on infected computers.
- Kelihos_ver3: Trojan family that distributes spam email messages containing hyperlinks to installers of the Kelihos malware.
- Vundo: Trojan family that can redirect web searches and display unwanted ads
- Simda: Family of password-stealing trojans that can give a malicious hacker backdoor access and control to a system. They can then steal passwords and gather information about the system.
- Tracur: Trojan family that can redirect web searches, display unwanted ads, and download and run other malware.
- Obfuscator.ACY: Malware family designed to evade detection by security software
- Gatak: Malware family that steals sensitive information and gives a malicious hacker access and control of the infected computer
Given a dataset of assembly codes and byte files, our goal was to train a machine learning model to accurately classify the files into their respective malware families. The model should be able to classify new, unseen files with high accuracy, and should also provide probability estimates. The goal was to develop a solution that could help identify and prevent the spread of malware.
- The model should have a high accuracy rate in order to be effective at detecting and preventing the spread of malware.
- The model should be able to process and classify large numbers of files quickly and efficiently, while operating within the constraints of available computing resources.
The objective of this machine learning project was to develop a model that could accurately classify assembly code and byte files into their respective malware families and predict the probability of each data point belonging to each of the nine classes.
- Class probabilities were needed.
- Errors in class probabilities should be penalized.
- Primary Metric: Log loss
- Secondary Metric: Confusion Matrix
-
Preprocessing: The data was preprocessed by converting the .asm files into image files and extracting features using unigram and bigram approaches on hexadecimal pairs, as well as by extracting commands, headers, and pixel features. Feature selection techniques using chi2, annova, and mutual info scores were applied to reduce the number of features from 140000 to 1900.
-
Modeling: Several machine learning models were trained using the preprocessed data, including lightgbm. The models were optimized using hyperopt (bayesian optimization) to find the best parameters.
-
Evaluation: The performance of the models was evaluated using log loss and the confusion matrix as the primary and secondary metrics.
The dataset consists of 200 GB of data, with 50 GB of .bytes files and 150 GB of .asm files. We use multiprocessing to extract the features from these files.
We use a custom feature extractor to read through the 50 GB of .bytes data using multiprocessing. We run unigram and bigram on sequences of hexadecimal strings, such as D8 4C 5B 67 85 C0 75 3D 68 9C 00 00 00 E8 3E 1C, where each hexadecimal string can be considered as a single word. We used a custom feature extractor to read through 50GB of .bytes data using multiprocessing. We then ran unigram and bigram on sequences of hexadecimal pairs in the .bytes files. In addition to ngram features, we also extract file size features by calculating the size of each .bytes file in mega bytes.
We divided each line of an .asm file into four parts: header, memory address, data, instructions, and comments.
.text:675A13BE 68 8C 40 5B 67 push offset off_675B408C ; CODE XREF: sub_675A12B5+1Dp
We use regular expressions to identify and separate the following elements in each line:
- Header:
.text
- Memory address:
675A13BE
- Data:
68 8C 40 5B 67
- Instructions:
push offset off_675B408C
- Comments:
; CODE XREF: sub_675A12B5+1Dp
We extract the count of each of the headers, data, and instructions for every file.
The main goal of the EDA was to identify important features, as there were around 140,000 features in the dataset. To do this, we used techniques such as t-SNE to visualize the data and identify patterns and clusters. We also used the SelectKBest
function from sklearn
to select the top performing features based on criteria such as chi2
, f_classif
, and mutual_info_classif
.
For each feature set, including the bytes unigram bag of words, bytes bigram bag of words, assembly pixel intensity features, assembly unigram bag of words, assembly command/instruction features, and assembly header features, we plotted t-SNE to compare the original features with the best and remaining features. By using the minimum number of features that gave well-separated t-SNE plots, we were able to identify the most important features for each set.
We then combined the top features from each set to create the final feature set. The number of features selected from each set was as follows:
- Size features: 5
- Bytes unigram bag of words: 257
- Bytes bigram bag of words: 137
- Assembly header features: 44
- Assembly unigram bag of words: 256
- Assembly command/instruction features: 160
- Assembly pixel intensity: 1000
Overall, the efforts put into feature engineering and feature reduction were successful, resulting in well-separated clusters forming in two dimensions.
In order to establish a baseline for comparison, we evaluated the performance of a random model. The log loss of a random model is 2.2, which means that the log loss of our machine learning models should be less than 2.2 in order to be considered an improvement.
We evaluated the performance of several machine learning models for this classification task, including K Nearest Neighbors, Logistic Regression, Random Forest, XGBoost, and LightGBM.
Of all the models tested, LightGBM performed the best, with a log loss of 0.015, accuracy of 99.88%, micro and macro F1 score of close to 1.
There precision and recall scores of all the classes were between 0.98 to 1.