Skip to content

CGCL-codes/MalPacDetector-core

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MalPacDetector

This repository hosts dataset MalnpmDB and malicious package detector MalPacDetector involved in the paper MalPacDetector: An LLM-based Malicious npm Package Detector.

Requirements

Environment

  • Operating System: Ubuntu 22.04
  • Python: Python 3.10.12
  • node.js: node.js v18.16.0

Setup

$ python3 configure.py

Follow the tooltips to configure the project. You can configure:

  • datasets path: Where to find npm packages. (default: datasets/MalnpmDB)
  • models path: Where to save trained models. (default: models)
  • reports path: Where to save prediction result reports. (default: reports)
  • features: Where to save extracted features. (default: features)
  • feature-positions: Where to save code line position information of extracted features. (default: feature-positions)

And, then use the following command to setup the project.

$ ./setup.sh

Once you setup the project, you will see the following folders:

  • conf: containing configuration and settings files.
  • datasets: containing MalnpmDB dataset.
  • feature-extract: containing feature extraction code files.
  • training: containing training and prediction code files.

If you using default configuration, you will see the following folders as well:

  • models: containing trained machine learning models.
  • reports: containing npm packages prediction reports.
  • features: containing npm packages' features extracted by feature extractor.
  • feature-positions: containing feature position information .

Usage

At first, you should activate python virtual environment:

$ source env/bin/activate

And there is a main python script file:

  • cli.py: for training a machine learning model and predicting npm packages. By specifying different paramaters, users can training different models or predicting different packages.

The paramaters available for performing a training or predicting task, which are listed below:

Options Description
-h Show all help information.
extract Extract features.
-h Show help information about extracting features.
-d npm dataset name.
train Train model.
-h Show help information about training models.
-m Malicious npm dataset name.
-b Benign npm dataset name.
-o Model used to train. ("NB", "MLP", "RF", "SVM")
-p Preprocess method. ("none", "standardlize", "min-max-scale")
-a Trainging or saving model. (training, save)
-hs smoothing of NB to save.
-hr Learning rate of MLP to save.
-hl Number of layers of MLP to save.
-hi Number of iterations of MLP to save.
-ho Optimization algorithm of MLP to save.
-ha Activation funtion of MLP to save.
-he Number of decision trees of RF to save.
-hd Maxium depth of RF to save.
-hg Gamma of SVM to save.
-hc C of SVM to save.
predict Predict npm packages.
-h Show help information about predicting npm pacakges.
-o Model used to predict.
-d npm dataset which stored gzip formatted npm packages.
-p npm package directory path.

For convenience, use the following command to show help information.

# Show all help information.
$ python3 cli.py -h

# Show help information about extracting features.
$ python3 cli.py extract -h

# Show help information about training models.
$ python3 cli.py train -h

# Show help information about predicting npm dataset.
$ python3 cli.py predict -h

Step 1: Extract features from npm dataset

The paramater related to model settings are presented in above table's field extract. The npm dataset should obey the following structure:

dataset_name
|__ <package_name-package_version1>.tar.gz
|__ <package_name-package_version2>.tar.gz
|__ ...
|__ <package_name-package_versionn>.tar.gz

The compressed package should have the following structure which is the formal npm structure:

package_name-package_version
|__ package
   |__ package.json
   |__ ...

Use the following command to extract features from npm dataset.

$ python3 cli.py extract -d <dataset_name>

Step 2: Train a classifier

The paramater related to model settings are stored in conf/settings.json, and are presented in above table's field train. This allows user to conveniently train different models or use different datasets.

Use the following command to train a classifier.

$ python3 cli.py train -a training -m <malicious_dataset_name> -b <benign_dataset_name> -p <preprocess_method> -o <model_name>

Step 3: Save the classifier

The paramater related to model settings are stored in conf/settings.json, and are presented in above table's field train.

Use the following command to train a classifier.

# NB
$ python3 cli.py train -a save -m <malicious_dataset_name> -b <benign_dataset_name> -p <preprocess_method> -o <model_name> -hs <smoothing>

# MLP
$ python3 cli.py train -a save -m <malicious_dataset_name> -b <benign_dataset_name> -p <preprocess_method> -o <model_name> -hr <learning_rate> -hl <number_of_layers> -hi <number_of_iterations> -ho <optimization_algorithm> -ha <activation_function>

# RF
$ python3 cli.py train -a save -m <malicious_dataset_name> -b <benign_dataset_name> -p <preprocess_method> -o <model_name> -he <number_of_decision_trees> -hd <maxium_depth>

# SVM
$ python3 cli.py train -a save -m <malicious_dataset_name> -b <benign_dataset_name> -p <preprocess_method> -o <model_name> -hg <Gamma> -hc <C>

Step 4: Predict npm packages

The paramater related to model settings are presented in above table's field predict.

Use the following command to predict packages.

$ python3 cli.py predict -o <model_name> -d <dataset_name>

For convenience, you can just use one command to pass above steps to predict a single package.

$ python3 cli.py predict -o <model_name> -p <package_path>

Hyperparameters

Hyperparameter values of the 4 classifiers, where boldface means the best hyperparameter value of the model.

Model Hyperparameter
NB Smoothing terms: (1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4)
MLP Learning rate: 5 values randomly selected from a uniform distribution with the interval [0.01, 0.2] (0.0505)
Number of hidden units: (16, 32, 100, 150)
Number of iterations: (400, 600)
Optimization algorithm: (lbfgs, adam)
RF Number of decision trees: (16, 32, 64, 100, 128, 256, 512)
Maximum depth: (3, 5, 7, 11, 15)
SVM Gamma: (scale, auto, 3 values randomly selected from a normal distribution with mean 0.2 and standard deviation 0.075) (scale)
C: 3 values randomly selected from a uniform distribution with the [0.5, 2.0] (1.0704)

Dataset and Results

  • Dataset: Containing malicious dataset mal and benign dataset ben in datasets/MalnpmDB which has 3258 and 4051 packages respectively.
  • Training and Validation Results: Model training and validation results are stored in trainging/result directory, which named ***_validation.csv, where *** represents model name.

Contact

Since the paper not having been published, and for security reasons, we can't place the malicious package dataset here. If you need the dataset, please send a request to hust_jianw@hust.edu.cn.

Any bug report or improvement suggestions will be appreciated. Please kindly cite our paper if you use the code or data in your work.

Thanks!

About

The source code of MalPacDetector for npm.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •