bilbo-bagging-hybrid

Code to go with the paper "Real-Time Detection of Dictionary DGA Network Traffic using Deep Learning" and our presentations at ShmooCon 2018, Deep Learning World 2018, and the Australian Cyber Security Centre Conference (2018).

Paper Abstract

Botnets and malware continue to avoid detection by static rules engines when using domain generation algorithms (DGAs) for callouts to unique, dynamically generated web addresses. Common DGA detection techniques fail to reliably detect DGA variants that combine random dictionary words to create domain names that closely mirror legitimate domains. To combat this, we created a novel hybrid neural network, Bilbo the "bagging" model, that analyses domains and scores the likelihood they are generated by such algorithms and therefore are potentially malicious. Bilbo is the first parallel usage of a convolutional neural network (CNN) and a long short-term memory (LSTM) network for DGA detection. Our unique architecture is found to be the most consistent in performance in terms of AUC, F1 score, and accuracy when generalising across different dictionary DGA classification tasks compared to current state-of-the-art deep learning architectures. We validate using reverse-engineered dictionary DGA domains and detail our real-time implementation strategy for scoring real-world network logs within a large financial enterprise. In four hours of actual network traffic, the model discovered at least five potential command-and-control networks that commercial vendor tools did not flag.

Data

The data used for the experiments documented are from DGArchive, Alexa Top 1 Million, and from live enterprise logs. While we cannot publish the enterprise logs, we can publish/direct you to the other data sets.

DGArchive

DGArchive is lead by Daniel Plohmann when we needed the data for our experiments. If you would like access to the same data, you must request it through him. See the details on how on their website.

Alexa Top 1 Million

This dataset was collected in an Amazon S3 Bucket, but appears to have stopped being released since we retrieved it in 2017. We have included it here in this repository for your convenience.

Models

Based on other experiments done on DGA detection, we compared Bilbo to four other deep learning model architectures:

Artificial Neural Network (ANN) - Single layer
Long Short-Term Memory (LSTM) Network
Convolutional Neural Network (CNN)
MIT's CNN-LSTM Hybrid Model - adapted in Vosoughi, et al. (2016 - Tweet2vec) and Yu, et al. (2018)

More details on each model, including the Keras/Tensorflow architecture we used with Python 3.7, is available within the models/ directory.

Acknowledgements

Thank you to Capital One for the incredible opportunity to deploy a machine learning model developed for research into a live environment for evaluation. To Jason Trost, your mentorship and intellectual curiosity inspires everyone around you. We appreciate your and Capital One's support to publish our work as an academic paper after our talks in industry.

To the reviewers at our last attempted venues, thank you for the incredible feedback that greatly improved our analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
data		data
models		models
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

bilbo-bagging-hybrid

Paper Abstract

Data

DGArchive

Alexa Top 1 Million

Models

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

jinxmirror13/bilbo-bagging-hybrid

Folders and files

Latest commit

History

Repository files navigation

bilbo-bagging-hybrid

Paper Abstract

Data

DGArchive

Alexa Top 1 Million

Models

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages