This repository hosts dataset MalnpmDB and malicious package detector MalPacDetector involved in the paper MalPacDetector: An LLM-based Malicious npm Package Detector.
- Operating System: Ubuntu 22.04
- Python: Python 3.10.12
- node.js: node.js v18.16.0
$ python3 configure.py
Follow the tooltips to configure the project. You can configure:
- datasets path: Where to find npm packages. (default: datasets/MalnpmDB)
- models path: Where to save trained models. (default: models)
- reports path: Where to save prediction result reports. (default: reports)
- features: Where to save extracted features. (default: features)
- feature-positions: Where to save code line position information of extracted features. (default: feature-positions)
And, then use the following command to setup the project.
$ ./setup.sh
Once you setup the project, you will see the following folders:
- conf: containing configuration and settings files.
- datasets: containing MalnpmDB dataset.
- feature-extract: containing feature extraction code files.
- training: containing training and prediction code files.
If you using default configuration, you will see the following folders as well:
- models: containing trained machine learning models.
- reports: containing npm packages prediction reports.
- features: containing npm packages' features extracted by feature extractor.
- feature-positions: containing feature position information .
At first, you should activate python virtual environment:
$ source env/bin/activate
And there is a main python script file:
- cli.py: for training a machine learning model and predicting npm packages. By specifying different paramaters, users can training different models or predicting different packages.
The paramaters available for performing a training or predicting task, which are listed below:
Options | Description |
---|---|
-h | Show all help information. |
extract | Extract features. |
-h | Show help information about extracting features. |
-d | npm dataset name. |
train | Train model. |
-h | Show help information about training models. |
-m | Malicious npm dataset name. |
-b | Benign npm dataset name. |
-o | Model used to train. ("NB", "MLP", "RF", "SVM") |
-p | Preprocess method. ("none", "standardlize", "min-max-scale") |
-a | Trainging or saving model. (training, save) |
-hs | smoothing of NB to save. |
-hr | Learning rate of MLP to save. |
-hl | Number of layers of MLP to save. |
-hi | Number of iterations of MLP to save. |
-ho | Optimization algorithm of MLP to save. |
-ha | Activation funtion of MLP to save. |
-he | Number of decision trees of RF to save. |
-hd | Maxium depth of RF to save. |
-hg | Gamma of SVM to save. |
-hc | C of SVM to save. |
predict | Predict npm packages. |
-h | Show help information about predicting npm pacakges. |
-o | Model used to predict. |
-d | npm dataset which stored gzip formatted npm packages. |
-p | npm package directory path. |
For convenience, use the following command to show help information.
# Show all help information.
$ python3 cli.py -h
# Show help information about extracting features.
$ python3 cli.py extract -h
# Show help information about training models.
$ python3 cli.py train -h
# Show help information about predicting npm dataset.
$ python3 cli.py predict -h
The paramater related to model settings are presented in above table's field extract. The npm dataset should obey the following structure:
dataset_name
|__ <package_name-package_version1>.tar.gz
|__ <package_name-package_version2>.tar.gz
|__ ...
|__ <package_name-package_versionn>.tar.gz
The compressed package should have the following structure which is the formal npm structure:
package_name-package_version
|__ package
|__ package.json
|__ ...
Use the following command to extract features from npm dataset.
$ python3 cli.py extract -d <dataset_name>
The paramater related to model settings are stored in conf/settings.json
, and are presented in above table's field train. This allows user to conveniently train different models or use different datasets.
Use the following command to train a classifier.
$ python3 cli.py train -a training -m <malicious_dataset_name> -b <benign_dataset_name> -p <preprocess_method> -o <model_name>
The paramater related to model settings are stored in conf/settings.json
, and are presented in above table's field train.
Use the following command to train a classifier.
# NB
$ python3 cli.py train -a save -m <malicious_dataset_name> -b <benign_dataset_name> -p <preprocess_method> -o <model_name> -hs <smoothing>
# MLP
$ python3 cli.py train -a save -m <malicious_dataset_name> -b <benign_dataset_name> -p <preprocess_method> -o <model_name> -hr <learning_rate> -hl <number_of_layers> -hi <number_of_iterations> -ho <optimization_algorithm> -ha <activation_function>
# RF
$ python3 cli.py train -a save -m <malicious_dataset_name> -b <benign_dataset_name> -p <preprocess_method> -o <model_name> -he <number_of_decision_trees> -hd <maxium_depth>
# SVM
$ python3 cli.py train -a save -m <malicious_dataset_name> -b <benign_dataset_name> -p <preprocess_method> -o <model_name> -hg <Gamma> -hc <C>
The paramater related to model settings are presented in above table's field predict.
Use the following command to predict packages.
$ python3 cli.py predict -o <model_name> -d <dataset_name>
For convenience, you can just use one command to pass above steps to predict a single package.
$ python3 cli.py predict -o <model_name> -p <package_path>
Hyperparameter values of the 4 classifiers, where boldface means the best hyperparameter value of the model.
Model | Hyperparameter |
---|---|
NB | Smoothing terms: (1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4) |
MLP | Learning rate: 5 values randomly selected from a uniform distribution with the interval [0.01, 0.2] (0.0505) Number of hidden units: (16, 32, 100, 150) Number of iterations: (400, 600) Optimization algorithm: (lbfgs, adam) |
RF | Number of decision trees: (16, 32, 64, 100, 128, 256, 512) Maximum depth: (3, 5, 7, 11, 15) |
SVM | Gamma: (scale, auto, 3 values randomly selected from a normal distribution with mean 0.2 and standard deviation 0.075) (scale) C: 3 values randomly selected from a uniform distribution with the [0.5, 2.0] (1.0704) |
- Dataset: Containing malicious dataset mal and benign dataset ben in
datasets/MalnpmDB
which has 3258 and 4051 packages respectively. - Training and Validation Results: Model training and validation results are stored in
trainging/result
directory, which named***_validation.csv
, where***
represents model name.
Since the paper not having been published, and for security reasons, we can't place the malicious package dataset here. If you need the dataset, please send a request to hust_jianw@hust.edu.cn.
Any bug report or improvement suggestions will be appreciated. Please kindly cite our paper if you use the code or data in your work.
Thanks!