Skip to content

alexeypustynnikov/Microsoft-Malware-Prediction

Repository files navigation

Microsoft-Malware-Prediction

This is the code of our final submission ( https://github.com/d-eremeev and https://github.com/alexeypustynnikov).

General idea of pipeline

First of all, train lasso regression (h2o package) on processed and banded data to create scoring. Then pass it as new factor to LightGBM.

Description of files

  1. pyeda.py class with auxiliary functions for data processing (see details below)
  2. preprocessing.ipynb main file for all data processing
    • Loading and feature engineering
    • Banding
    • Imputing missing values
    • Replacing values in test set that do not exist in train set
    • Visualization
  3. h2o_lasso_scoring.ipynb creating lasso scoring to pass into LightGBM
  4. lightgbm_with_score.ipynb LighGBM model

About pyeda.py class

The class takes six values as initial parameters:

  • path_train:

    string (reqired)
    path to train file (*.csv)

  • path_test:

    string (optional)
    path to test file (*.csv)
    default value is None

  • id_column:

    string (optional)
    name of unique identifier
    default value is None

  • responce:

    string (optional) name of column that would be threated as responce
    default value is None

  • dtypes:

    dict (optional)
    dictionary of column names and datatypes as values
    default velue is None

  • band_suffix:

    string (optional)
    suffix that would be added to column name after banding
    default value is '_banded'

Methods:

  • add_factor:

    method to add new factor
    Parameters:
    fac_name -- name of new factor
    fac_data_train -- data to be added to train data
    fac_data_test -- data to be added to test data (optional)

  • null_stats:

    method that generates missed values statistics
    Parameters:
    type_ -- train or test data flag

  • band:

    method that bands given factor
    banding here is cutting off all levels that have bigger number of entries then treshold for categorical
    for numerical on the other hand we use list to create bins
    Parameters:
    factor -- factor to be banded
    threshold -- value for cut off
    banding_list -- list for binning
    label_list -- list to generate labels
    type_ -- type of variable
    can be 'category' for categorical 'numeric' for numerical drop_factor -- should programme drop old factor? True if Yes.

  • band_list:

    method that allows to use band method to many factors at once
    this method is decorated with tqdm
    Parameters:
    factor_list_dict -- factors to be banded and tresholds or bins lists
    threshold -- value for cut off
    drop_factor -- should programme drop old factor? True if Yes.

  • one_way_plots:

    method that allows to plot many factors at once (matplotlib)
    Parameters:
    factor_list -- factors to be plotted
    Save -- Should plot be saved?
    File_path -- Path to file

  • one_way_plots_plotly:

    method that allows to plot many factors at once (plotly)

  • missed_values_test:

    returns dictionary with lists of excess levels per factor

  • replace_NA:

    method for replacing NaNs
    this metod works different for different data types
    Parameters:
    factor -- factors to be processed
    type_ can be 'NA' for categorical
    'Mode' for categorical or numerical
    'Numeric' for numerical

  • get_value_counts

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published