Install poetry
pip install poetry
Install dependencies
poetry install
Enter Poetry Shell
poetry shell
Prior to running scripts in GraphIsomorphismNetwork/JAX-GIN branch, run unset LD_LIBRARY_PATH
to ensure that the jax library can properly use cuda devices.
To generate Control Flow Graphs (CFGs) from binary files, use the cfg_creator.py
script. This script analyzes binary files and creates CFGs, which can be visualized or saved in different formats.
Run the script from the root directory using the following command:
python src/cfg_constructor/cfg_creator.py --data-dir <path_to_binary_files> --vis-mode <visualization_mode> --job-id <job_id> --id-list <fname.txt>
Args:
--data-dir
: Path to the directory containing the binary files in the parent directory. (str, default='data')--vis-mode
: Visualization mode. 0 = visualize in window, 1 = save as HTML docs, 2 = save graphs w/o visualizing as edgelists andcsv
for node values. (int, default=2)--job-id
: int for job id for use for logging + avoiding reprocessing already processed data based on job_id, vis_mode, and data dir (int, default=0)--id-list
: [OPTIONAL] .txt file with a list of ids to constrain processing to (str, default='', no file input; if a file is put in, it must be a valid.txt
file which is a list of ids, single id per line)- Note: if using the
--id-list
option, reference thesrc/cfg_constructor/demo_file_id.txt
for how your input file should be formatted. If not, the tool will automatically create graphs for all files in the specified data directory.
- Note: if using the
Output will be stored dependent on the vis_mode:
- vis-mode=0: no saved output, graphs will be displayed in GUI
- vis-mode=1: HTML files in
src/cfg_constructor/out/out_html
- vis-mode=2: CSV files in
src/cfg_constructor/out/out_adjacency_matrices
Example usage from my machine:
(malware-analysis-py3.12) me@mac Malware-Analysis %
python src/cfg_constructor/cfg_creator.py --data-dir data --vis-mode 2 --job-id 0
This goes to the root directory of the repository and runs the constructor from a data
dir (also in the root directory), visualizes each with mode 2
(saving adjacency lists to specified dir above), and assigns job_id 0
for logging (i.e. program crashes, can easily resume)
If a logging file with the existing job already exists, the script will load that and silently skip any files marked as processed by that log file.
Using Static Analysis (deconstruction of binaries without execution) to extract Control Flow Graphs from a binary.
Leverage Graph Neural Networks trained on these CFGs to classify an arbitrary binary as malicious or benign. We aim to primarily utilize a dataset of 200k+ Windows PE binaries linked here
Produce a pipeline capable of performing deconstruction + inference very fast.
Feature based models (i.e. XGBoost -> tree model, Yara Rules -> condition matching) can run in <1s and NLP tools (i.e. Kilogram paper -> n-gram analysis) can also run fairly fast.
Our hypothesis is that GNNs can capture more complex characteristics of malicious binaries via their CFGs and by training a large model and compressing it to a smaller downstream one, we can match the accuracy of feature based approaches with a fairly close inference time as well.
TBD - do more research here. Added potential ones but requires more insight
- Distillation (teaching a smaller downstream model to learn the behavior of the large model; famous from DistilBERT)
- Quantization (possibly quantization aware training to facilitate this approach)
- Pruning (self explanatory; remove weights determined as irrelevant by some arbitrary technique)
- Theoretical Optimization based on BERT-of-Theseus paper
- Paper is centered around replacing large BERT modules with small modules while training to get small modules to mimic behavior of large ones in network
- Depends on GNN architecture but could this be applied here? Can we have an optimization method where modules with 50%, 75% less params than large-GNN modules are randomly inserted in some trained network and trained to mimic the role of the large modules (similar to distillation but incorporated in the network)?