This project presents a novel approach to malware detection and classification by leveraging Graph Neural Networks (GNNs). The model analyzes the API function call graphs of software to classify it as either malicious or benign. This technique is effective for both Android (APKs) and Windows (PE files) and is designed to be resilient against common obfuscation techniques.
Traditional signature-based malware detection methods can be easily bypassed by slightly modifying the malware's code. This project overcomes that limitation by focusing on the underlying behavior of the software. By representing an application as a graph of its API calls, the GNN model learns to identify suspicious patterns and relationships that are characteristic of malicious activity.
Graph-Based Detection: Each application is converted into a function call graph, allowing the GNN to analyze its structure and control flow to identify malicious patterns.
Obfuscation Resilience: By focusing on core API call patterns rather than specific code signatures, the system is highly robust against evasive malware attacks and common obfuscation techniques.
Model Explainability: The project implements edge pruning on the call graph to identify and rank the most critical API calls that contribute to a malware classification. This provides valuable insights into the model's decision-making process, making it more transparent and trustworthy.
Cross-Platform: The methodology is applicable to both Android (APKs) and Windows Portable Executable (PE) files.
Core Framework: PyTorch
Model Architecture: Graph Neural Networks (GNNs)
Embeddings: Skip-gram for learning representations of API calls.
Graph Extraction: The first step is to statically analyze the executable (APK or PE file) and extract its function call graph. Nodes in the graph represent functions, and edges represent calls between them.
Graph Representation: The extracted graph is then processed and converted into a format suitable for the GNN.
Model Training: The GNN model is trained on a labeled dataset of benign and malicious software samples to learn the patterns associated with malware.
Classification and Explanation: Once trained, the model can classify new, unseen software and use techniques like edge pruning to explain its predictions.