A collection of scripts for a simple Linux ELF (x86/x64 only) malware analsysis, including VirusShare processing, static analysis and automated features extraction from VirusTotal reports for subsequent Data Analytics applications.
Through the history of desktop and server-oriented malware, Microsoft Windows was notoriously known as one of the heavily attacked Operating Systems (OS). Several factors caused this, including unobstructed installation of 3rd party software. Linux-/Unix-like OS is considerably less susceptible to malware infections. However, there are still a few examples of successful malicious software. The challenge is that there are not that many software tools available to analyze Linux malware and do automated processing. This is a set of scripts aimed at static features extraction in addition to labelling Linux malware into families and type based on the input from VirusTotal.
We focus specifically on Linux PC malware (Intel 3086 and x86-64) and not on ARM / MIPS platforms, while having many more samples for our experimentation. It was developed the following methodology using static analysis based on the characteristics from Linux native tools and threat intelligence from VirusTotal platform. To repeat our methodology, follow the following phases:
-
Phase 0: Acquire samples from VirusShare - the most comprehensive and known in information security community ELF Linux malware samples that are also publicly available. Put them into ELF/ folder. Import attached ELF_structure.sql into MySQL server.
-
Phase 1: Filtering all files that are not completed for ELF architecture, Performing extraction of raw information for every malware binary file such as
md5
, JSONpeframe
report,readelf
,file
,strings
, file size and entropy. All information is being stored in MySQL database for easier subsequent access. -
Phase 2: Filtering ELF Linux malware samples that have been compiled for either Intel 3086 or Intel x64-86 platform based on extracted metadata. We specifically exclude any other binaries such as ARM / MIPS to facilitate a better "ground truth" in experiments and unbiased results. Then, it was performed an extraction of the reports using VirusTotal Private API.
-
Phase 3: Feature extraction is being performed on all types of raw data extracted at the previous phase. As a basis, the following categories of metadata and characteristics were used:
virustotal_file_report
,peframe
,readelf
,strings
,file_size
andfile_entropy
.
As input it was used ELF collection from VirusSare:
-VirusShare_ELF_20140617.zip - 2,778 files
-VirusShare_ELF_20190212.zip - 10,426 files
-VirusShare_ELF_20200405.zip - 43,553 files
-VirusShare_Linux_20160715.zip - 9,469 files
The output resulted in 10,574 MySQL entries corresponding to labelled malware files into families and types. In overall, 30 extracted features were extracted.
FEATURES:
vt_submission_names
- Number of submission names
vt_times_submitted
- Times the binary was submitted
vt_exif
- Number of entries in exiftool
-vt_embedded_ips
- Number of embedded IPs in the binary
-vt_contacted_ips
- Number of IPs the binary contacted
-vt_exports
- Number of export functions
-vt_imports
- Number of import functions
-vt_shared_libraries
- Number of shared libraries included
-vt_segments
- Number of segments
-vt_sections
- Number of sections
-vt_packers
- Number of packers
-vt_tags
- Number of tags
-vt_positives
- Number of AV identified as malicious
-peframe_ip
- Number of identified IP addresses
-peframe_url
- Number of identified URLs
-readelf_entry_address
- Entry point address
-readelf_start_prog_headers
- Start of program headers
-readelf_start_sec_headers
- Start of section headers
-readelf_number_flags
- Number of flags
-readelf_header_size
- Size of this header
-readelf_size_prog_headers
- Size of program headers
-readelf_number_prog_headers
- Number of program headers
-readelf_size_sec_headers
- Size of section headers
-readelf_number_section_headers
- Number of section headers
-readelf_sec_header_string_table_index
- Section header string table index
-strings_number
- Number of distinct strings
-strings_size
- Size of all strings extracted from the file
-strings_avg
- Average size of each string
-file_size
- Size of the file
-file_entropy
- Entropy of the whole file content
LABELS:
-type
-family
Resulting data can be used as an input to any Machine Learning methods like Deep Neural Networks, etc
- MySQL Server 5.7
- Python 3.7 + modules: mysql, mysql-connector, os, haslib, json, subprocess, timeit, requests, re
- Linux software: readelf, file, strings, ent
- Peframe 6.0.3
You will have to get VirusTotal premium access to be able to extract data in a timely manner. Inser your private API key in the phase2.py and phase4.py files.
You can find more information about the practical experiments and datasets in the following conference paper:
@INPROCEEDINGS{shalaginov2020elf,
title={A novel study on multinomial classification ofx86/x64 Linux ELF malware types and families through Deep Neural Networks},
author={Shalaginov, Andrii and Øverlier, Lasse},
booktitle={Malware Analysis using Artificial Intelligence and Deep Learning},
year={2020}
}