Choa, de Veyra, Escalona, Fortiz
This is a repository for the Thesis "Evaluation and Comparison of Boosted ML Models in Behavior-Based Malware Detection".
It contains the Jupyter notebook files and datasets used for the development of the study.
- Windows (Recommended)
- Linux (Debian-based)
Kindly install these before proceeding to the next step.
- Install Python and Anaconda/Conda accordingly.
- Once the two are installed, open
Anaconda Prompt
AS ADMINISTRATOR in your computer and navigate to the local copy of the repository in your computer.- Make sure to install
graphiz
to allow for tree visualization in CatBoost. - Make sure to follow the instructions shown in
.\Graphiz\README.md
regarding the installation of graphiz.
- Make sure to install
- Once navigated, type
install.bat
for Windows orinstall.sh
for Linux. The script will begin the installation of the necessary dependencies/libraries for your Conda environment. - Once completed, you can now begin exploring the thesis project files.
- Open
Anaconda Prompt
- Navigate to the location of the GitHub repository on your computer.
- Type
jupyter notebook
- To terminate
jupyter
, simplyCtrl+C
on the Anaconda Prompt.
- Install Anaconda as shown here.
- Once completed, run Anaconda Terminal (assuming
conda config --set auto_activate_base False
) by typingsoure <PATH_TO_ANACONDA>/bin/activate
Make sure you have installed the CUDA Toolkit in your machine to ensure that GPU (CUDA-specific) is supported. Do note that this may replace (downgrade) your GPU driver.
- Download the latest GCC
- Download the latest CMake
- Download the Boost v1.56.0
- Follow the guide accordingly
Note that the installation of LightGBM with CUDA support has a steep learning curve.
Due to the non-deterministic and entropic nature of the models and functions used in this study, it is not expected that the actual results are not guaranteed to be 1:1 to the results obtained by the study. However, the overall trends shall remain the same. In addition, the proponents of this experiment and study have done its due diligence to make sure that the results will be as consistent as possible by utilizing a consistent seed value on all notebooks to make the results as predictable as possible from each run of the notebooks.