This project is part of Google Sumer of Code 2018.
The goals of this project are to
- implement a decent learning model to predict labels of each malware sample
- discover relationships between different malware samples
- visualize relationships in frontend
- and build an analytic pipeline to integrate the implemented services.
- Generate the gRPC client and server interfaces for feed handling and tensorflow serving
$ cd src
$ cd feedhandling
$ python -m grpc_tools.protoc -I. --python_out=. --grpc_python_out=. feed_handling.proto
$ cd ..
$ cd tflearning
$ python -m grpc_tools.protoc -I. --python_out=. --grpc_python_out=. tf_learning.proto
$ cd ..
- Generate the gRPC client interface for frontend service
$ cd frontend
$ mkdir grpc-js
$ protoc -I=../feedhandling/ --js_out=import_style=closure,binary:./grpc-js
../feedhandling/feed_handling.proto
$ protoc -I=./grpc-web/third_party/protobuf/src/google/protobuf/ \
--js_out=import_style=closure,binary:./grpc-js \
./grpc-web/third_party/protobuf/src/google/protobuf/any.proto
$ protoc -I=./grpc-web/net/grpc/gateway/protos/ \
--js_out=import_style=closure,binary:./grpc-js \
./grpc-web/net/grpc/gateway/protos/stream_body.proto
$ protoc -I=./grpc-web/net/grpc/gateway/protos/ \
--js_out=import_style=closure,binary:./grpc-js \
./grpc-web/net/grpc/gateway/protos/pair.proto
- Generate gRPC-Web protoc plugin and the client stub service file (feedhandling.grpc.pb.js)
$ cd grpc-web/javascript/net/grpc/web
$ make
$ cd -
$ protoc -I=. --plugin=protoc-gen-grpc-web=<path to>/protoc-gen-grpc-web \
--grpc-web_out=out=feedhandling.grpc.pb.js,mode=grpcweb:. \
../feedhandling/feed_handling.proto
- Compile all the relevant JS files into one single JS library that can be used in the browser
$ java \
-jar <path to>/closure-compiler.jar \
--js ./grpc-web/javascript \
--js ./grpc-web/net \
--js ./grpc-web/third_party/closure-library \
--js ./grpc-web/third_party/protobuf/js \
--js ./grpc-js \
--entry_point=goog:proto.feedhandling.FeedHandlingClient \
--dependency_mode=STRICT \
--js_output_file ./grpc-js/compiled.js
- Compile specific modules into Nginx in order for grpc-web requests to be interpreted and proxied to the backend gRPC server
$ cd grpc-web
$ make package
Programs in src/preprocessing
are responsible for data preprocessing. Please run in the following order:
preprocess_SERVICE_NAME.scala
preprocess_SERVICE_NAME.py
After preprocessing, the data will be store in the database for further usage.
Currently there are four supported services: Cuckoo, Objdump, PEinfo, Rich header.
With the preprocessed data from the previous step, we can use src/tflearning/NN.py
to train the learning model.
from tflearning.NN import NN
nn_instance = NN(PREPROCESSED_DATA_PATH, labels_length=29)
nn_instance.build()
skf = nn_instance.split_train_test(3, 0)
for train_index, test_index in skf:
nn_instance.prepare_data(train_index, test_index)
nn_instance.train()
nn_instance.test()
nn_instance.save()
It is also allowed to retrain the learning model with the following script:
nn_instance = NN(PREPROCESSED_DATA_PATH, label_length=29)
nn_instance.restore()
skf = nn_instance.split_train_test(3, 0)
for train_index, test_index in skf:
nn_instance.prepare_data(train_index, test_index)
nn_instance.retrain()
In order to discover the relationships between different malware samples, we build a KDTree using src/relationship/FeatureTree.py
.
$ python relationship/FeatureTree.py
usage: python fh_server.py [-h] [-v] [-p PORT] [--tfl-addr TFL_ADDR]
[--cluster-ip [CLUSTER_IP [CLUSTER_IP ...]]]
[--cluster-port CLUSTER_PORT]
[--auth-username AUTH_USERNAME]
[--auth-password AUTH_PASSWORD] [--offline]
Feed handling server
optional arguments:
-h, --help show this help message and exit
-v, --verbose Verbose mode
-p PORT, --port PORT Listening port for feed handling server
--tfl-addr TFL_ADDR Address of tensorflow learning server
--cluster-ip [CLUSTER_IP [CLUSTER_IP ...]]
IPs of clusters
--cluster-port CLUSTER_PORT
Port of clusters
--auth-username AUTH_USERNAME
Username for clusters' authentication
--auth-password AUTH_PASSWORD
Password for clusters' authentication
--offline Offline mode
usage: python tfl_server.py [-h] [-v] [-p PORT] [--fh-addr FH_ADDR]
[--model-path MODEL_PATH] [--offline]
Tensorflow learning server
optional arguments:
-h, --help show this help message and exit
-v, --verbose Verbose mode
-p PORT, --port PORT Listening port for tensorflow learning server
--fh-addr FH_ADDR Address of feed handling server
--model-path MODEL_PATH
Location of the learning model
--offline Offline mode
Before running Nginx, the absolute path to src/frontend
should be provided:
...
server {
listen 8888;
server_name localhost;
location / {
root <path-to>/frontend;
include /etc/nginx/mime.types;
}
...
$ cp src/frontend/nginx.conf src/frontend/grpc-web/gConnector/conf
$ cd src/frontend/grpc-web/gConnector && ./nginx.sh &
Afterwards, view the visualization of relationships in http://localhost:9090/index.html.
- Copy and put all the necessary sample data into
src/relationship
$ cp tests/*.p src/relationship
- Run feed handling and Tensorflow learning servers
$ python fh_server.py -v -p 9090 --tfl-addr localhost:9091 --offline
$ python tfl_server.py -v -p 9091 --fh-addr localhost:9090 --offline
- Run the Nginx service
$ cp src/frontend/nginx.conf src/frontend/grpc-web/gConnector/conf
$ cd src/frontend/grpc-web/gConnector && ./nginx.sh &
- Play around the relationships of sample data in http://localhost:9090/index.html.