[GSoC 2018] Holmes Processing: Automated Malware Relationship Mining

This project is part of Google Sumer of Code 2018.

Overview

The goals of this project are to

implement a decent learning model to predict labels of each malware sample
discover relationships between different malware samples
visualize relationships in frontend
and build an analytic pipeline to integrate the implemented services.

Prerequisites

Installation

Generate the gRPC client and server interfaces for feed handling and tensorflow serving

$ cd src

$ cd feedhandling
$ python -m grpc_tools.protoc -I. --python_out=. --grpc_python_out=. feed_handling.proto
$ cd ..

$ cd tflearning
$ python -m grpc_tools.protoc -I. --python_out=. --grpc_python_out=. tf_learning.proto
$ cd ..

Generate the gRPC client interface for frontend service

$ cd frontend
$ mkdir grpc-js

$ protoc -I=../feedhandling/ --js_out=import_style=closure,binary:./grpc-js
      ../feedhandling/feed_handling.proto
$ protoc -I=./grpc-web/third_party/protobuf/src/google/protobuf/ \
      --js_out=import_style=closure,binary:./grpc-js \
      ./grpc-web/third_party/protobuf/src/google/protobuf/any.proto
$ protoc -I=./grpc-web/net/grpc/gateway/protos/ \
      --js_out=import_style=closure,binary:./grpc-js \
      ./grpc-web/net/grpc/gateway/protos/stream_body.proto
$ protoc -I=./grpc-web/net/grpc/gateway/protos/ \
      --js_out=import_style=closure,binary:./grpc-js \
      ./grpc-web/net/grpc/gateway/protos/pair.proto

Generate gRPC-Web protoc plugin and the client stub service file (feedhandling.grpc.pb.js)

$ cd grpc-web/javascript/net/grpc/web
$ make

$ cd -
$ protoc -I=. --plugin=protoc-gen-grpc-web=<path to>/protoc-gen-grpc-web \
      --grpc-web_out=out=feedhandling.grpc.pb.js,mode=grpcweb:. \
      ../feedhandling/feed_handling.proto

Compile all the relevant JS files into one single JS library that can be used in the browser

$ java \
      -jar <path to>/closure-compiler.jar \
      --js ./grpc-web/javascript \
      --js ./grpc-web/net \
      --js ./grpc-web/third_party/closure-library \
      --js ./grpc-web/third_party/protobuf/js \
      --js ./grpc-js \
      --entry_point=goog:proto.feedhandling.FeedHandlingClient \
      --dependency_mode=STRICT \
      --js_output_file ./grpc-js/compiled.js

Compile specific modules into Nginx in order for grpc-web requests to be interpreted and proxied to the backend gRPC server

$ cd grpc-web
$ make package

Usage

Preprocessing

Programs in src/preprocessing are responsible for data preprocessing. Please run in the following order:

preprocess_SERVICE_NAME.scala
preprocess_SERVICE_NAME.py

After preprocessing, the data will be store in the database for further usage.

Currently there are four supported services: Cuckoo, Objdump, PEinfo, Rich header.

Neural networks

With the preprocessed data from the previous step, we can use src/tflearning/NN.py to train the learning model.

from tflearning.NN import NN

nn_instance = NN(PREPROCESSED_DATA_PATH, labels_length=29)
nn_instance.build()

skf = nn_instance.split_train_test(3, 0)

for train_index, test_index in skf:
    nn_instance.prepare_data(train_index, test_index)
    nn_instance.train()
    nn_instance.test()

nn_instance.save()

It is also allowed to retrain the learning model with the following script:

nn_instance = NN(PREPROCESSED_DATA_PATH, label_length=29)
nn_instance.restore()

skf = nn_instance.split_train_test(3, 0)

for train_index, test_index in skf:
    nn_instance.prepare_data(train_index, test_index)
    nn_instance.retrain()

Relationship Discovery

In order to discover the relationships between different malware samples, we build a KDTree using src/relationship/FeatureTree.py.

$ python relationship/FeatureTree.py

Analytic pipeline

Feed handling

usage: python fh_server.py [-h] [-v] [-p PORT] [--tfl-addr TFL_ADDR]
                           [--cluster-ip [CLUSTER_IP [CLUSTER_IP ...]]]
                           [--cluster-port CLUSTER_PORT]
                           [--auth-username AUTH_USERNAME]
                           [--auth-password AUTH_PASSWORD] [--offline]

Feed handling server

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         Verbose mode
  -p PORT, --port PORT  Listening port for feed handling server
  --tfl-addr TFL_ADDR   Address of tensorflow learning server
  --cluster-ip [CLUSTER_IP [CLUSTER_IP ...]]
                        IPs of clusters
  --cluster-port CLUSTER_PORT
                        Port of clusters
  --auth-username AUTH_USERNAME
                        Username for clusters' authentication
  --auth-password AUTH_PASSWORD
                        Password for clusters' authentication
  --offline             Offline mode

Tensorflow serving

usage: python tfl_server.py [-h] [-v] [-p PORT] [--fh-addr FH_ADDR]
                            [--model-path MODEL_PATH] [--offline]

Tensorflow learning server

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         Verbose mode
  -p PORT, --port PORT  Listening port for tensorflow learning server
  --fh-addr FH_ADDR     Address of feed handling server
  --model-path MODEL_PATH
                        Location of the learning model
  --offline             Offline mode

Frontend

Before running Nginx, the absolute path to src/frontend should be provided:

...

server {
  listen 8888;
  server_name localhost;
  location / {
    root <path-to>/frontend;
    include /etc/nginx/mime.types;
  }

...

$ cp src/frontend/nginx.conf src/frontend/grpc-web/gConnector/conf
$ cd src/frontend/grpc-web/gConnector && ./nginx.sh &

Afterwards, view the visualization of relationships in http://localhost:9090/index.html.

Testing

Copy and put all the necessary sample data into src/relationship

$ cp tests/*.p src/relationship

Run feed handling and Tensorflow learning servers

$ python fh_server.py -v -p 9090 --tfl-addr localhost:9091 --offline
$ python tfl_server.py -v -p 9091 --fh-addr localhost:9090 --offline

Run the Nginx service

$ cp src/frontend/nginx.conf src/frontend/grpc-web/gConnector/conf
$ cd src/frontend/grpc-web/gConnector && ./nginx.sh &

Play around the relationships of sample data in http://localhost:9090/index.html.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
src		src
tests		tests
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

[GSoC 2018] Holmes Processing: Automated Malware Relationship Mining

Overview

Prerequisites

Installation

Usage

Preprocessing

Neural networks

Relationship Discovery

Analytic pipeline

Feed handling

Tensorflow serving

Frontend

Testing

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

HolmesProcessing/gsoc18relationship

Folders and files

Latest commit

History

Repository files navigation

[GSoC 2018] Holmes Processing: Automated Malware Relationship Mining

Overview

Prerequisites

Installation

Usage

Preprocessing

Neural networks

Relationship Discovery

Analytic pipeline

Feed handling

Tensorflow serving

Frontend

Testing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages