MalPacDetector

This repository hosts dataset MalnpmDB and malicious package detector MalPacDetector involved in the paper MalPacDetector: An LLM-based Malicious npm Package Detector.

Requirements

Environment

Operating System: Ubuntu 22.04
Python: Python 3.10.12
node.js: node.js v18.16.0

Setup

$ python3 configure.py

Follow the tooltips to configure the project. You can configure:

datasets path: Where to find npm packages. (default: datasets/MalnpmDB)
models path: Where to save trained models. (default: models)
reports path: Where to save prediction result reports. (default: reports)
features: Where to save extracted features. (default: features)
feature-positions: Where to save code line position information of extracted features. (default: feature-positions)

And, then use the following command to setup the project.

$ ./setup.sh

Once you setup the project, you will see the following folders:

conf: containing configuration and settings files.
datasets: containing MalnpmDB dataset.
feature-extract: containing feature extraction code files.
training: containing training and prediction code files.

If you using default configuration, you will see the following folders as well:

models: containing trained machine learning models.
reports: containing npm packages prediction reports.
features: containing npm packages' features extracted by feature extractor.
feature-positions: containing feature position information .

Usage

At first, you should activate python virtual environment:

$ source env/bin/activate

And there is a main python script file:

cli.py: for training a machine learning model and predicting npm packages. By specifying different paramaters, users can training different models or predicting different packages.

The paramaters available for performing a training or predicting task, which are listed below:

Options	Description
-h	Show all help information.
extract	Extract features.
-h	Show help information about extracting features.
-d	npm dataset name.
train	Train model.
-h	Show help information about training models.
-m	Malicious npm dataset name.
-b	Benign npm dataset name.
-o	Model used to train. ("NB", "MLP", "RF", "SVM")
-p	Preprocess method. ("none", "standardlize", "min-max-scale")
-a	Trainging or saving model. (training, save)
-hs	smoothing of NB to save.
-hr	Learning rate of MLP to save.
-hl	Number of layers of MLP to save.
-hi	Number of iterations of MLP to save.
-ho	Optimization algorithm of MLP to save.
-ha	Activation funtion of MLP to save.
-he	Number of decision trees of RF to save.
-hd	Maxium depth of RF to save.
-hg	Gamma of SVM to save.
-hc	C of SVM to save.
predict	Predict npm packages.
-h	Show help information about predicting npm pacakges.
-o	Model used to predict.
-d	npm dataset which stored gzip formatted npm packages.
-p	npm package directory path.

For convenience, use the following command to show help information.

# Show all help information.
$ python3 cli.py -h

# Show help information about extracting features.
$ python3 cli.py extract -h

# Show help information about training models.
$ python3 cli.py train -h

# Show help information about predicting npm dataset.
$ python3 cli.py predict -h

Step 1: Extract features from npm dataset

The paramater related to model settings are presented in above table's field extract. The npm dataset should obey the following structure:

dataset_name
|__ <package_name-package_version1>.tar.gz
|__ <package_name-package_version2>.tar.gz
|__ ...
|__ <package_name-package_versionn>.tar.gz

The compressed package should have the following structure which is the formal npm structure:

package_name-package_version
|__ package
   |__ package.json
   |__ ...

Use the following command to extract features from npm dataset.

$ python3 cli.py extract -d <dataset_name>

Step 2: Train a classifier

The paramater related to model settings are stored in conf/settings.json, and are presented in above table's field train. This allows user to conveniently train different models or use different datasets.

Use the following command to train a classifier.

$ python3 cli.py train -a training -m <malicious_dataset_name> -b <benign_dataset_name> -p <preprocess_method> -o <model_name>

Step 3: Save the classifier

The paramater related to model settings are stored in conf/settings.json, and are presented in above table's field train.

Use the following command to train a classifier.

# NB
$ python3 cli.py train -a save -m <malicious_dataset_name> -b <benign_dataset_name> -p <preprocess_method> -o <model_name> -hs <smoothing>

# MLP
$ python3 cli.py train -a save -m <malicious_dataset_name> -b <benign_dataset_name> -p <preprocess_method> -o <model_name> -hr <learning_rate> -hl <number_of_layers> -hi <number_of_iterations> -ho <optimization_algorithm> -ha <activation_function>

# RF
$ python3 cli.py train -a save -m <malicious_dataset_name> -b <benign_dataset_name> -p <preprocess_method> -o <model_name> -he <number_of_decision_trees> -hd <maxium_depth>

# SVM
$ python3 cli.py train -a save -m <malicious_dataset_name> -b <benign_dataset_name> -p <preprocess_method> -o <model_name> -hg <Gamma> -hc <C>

Step 4: Predict npm packages

The paramater related to model settings are presented in above table's field predict.

Use the following command to predict packages.

$ python3 cli.py predict -o <model_name> -d <dataset_name>

For convenience, you can just use one command to pass above steps to predict a single package.

$ python3 cli.py predict -o <model_name> -p <package_path>

Hyperparameters

Hyperparameter values of the 4 classifiers, where boldface means the best hyperparameter value of the model.

Model	Hyperparameter
NB	Smoothing terms: (1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4)
MLP	Learning rate: 5 values randomly selected from a uniform distribution with the interval [0.01, 0.2] (0.0505) Number of hidden units: (16, 32, 100, 150) Number of iterations: (400, 600) Optimization algorithm: (lbfgs, adam)
RF	Number of decision trees: (16, 32, 64, 100, 128, 256, 512) Maximum depth: (3, 5, 7, 11, 15)
SVM	Gamma: (scale, auto, 3 values randomly selected from a normal distribution with mean 0.2 and standard deviation 0.075) (scale) C: 3 values randomly selected from a uniform distribution with the [0.5, 2.0] (1.0704)

Dataset and Results

Dataset: Containing malicious dataset mal and benign dataset ben in datasets/MalnpmDB which has 3258 and 4051 packages respectively.
Training and Validation Results: Model training and validation results are stored in trainging/result directory, which named ***_validation.csv, where *** represents model name.

Contact

Since the paper not having been published, and for security reasons, we can't place the malicious package dataset here. If you need the dataset, please send a request to hust_jianw@hust.edu.cn.

Any bug report or improvement suggestions will be appreciated. Please kindly cite our paper if you use the code or data in your work.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MalPacDetector

Requirements

Environment

Setup

Usage

Step 1: Extract features from npm dataset

Step 2: Train a classifier

Step 3: Save the classifier

Step 4: Predict npm packages

Hyperparameters

Dataset and Results

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
conf		conf
feature-extract		feature-extract
models		models
training		training
.gitignore		.gitignore
README.md		README.md
cli.py		cli.py
configure.py		configure.py
setup.sh		setup.sh

CGCL-codes/MalPacDetector-core

Folders and files

Latest commit

History

Repository files navigation

MalPacDetector

Requirements

Environment

Setup

Usage

Step 1: Extract features from npm dataset

Step 2: Train a classifier

Step 3: Save the classifier

Step 4: Predict npm packages

Hyperparameters

Dataset and Results

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages