This is the code of our final submission ( https://github.com/d-eremeev and https://github.com/alexeypustynnikov).
First of all, train lasso regression (h2o package) on processed and banded data to create scoring. Then pass it as new factor to LightGBM.
- pyeda.py class with auxiliary functions for data processing (see details below)
- preprocessing.ipynb main file for all data processing
- Loading and feature engineering
- Banding
- Imputing missing values
- Replacing values in test set that do not exist in train set
- Visualization
- h2o_lasso_scoring.ipynb creating lasso scoring to pass into LightGBM
- lightgbm_with_score.ipynb LighGBM model
The class takes six values as initial parameters:
- path_train:
string (reqired)
path to train file (*.csv) - path_test:
string (optional)
path to test file (*.csv)
default value is None - id_column:
string (optional)
name of unique identifier
default value is None - responce:
string (optional) name of column that would be threated as responce
default value is None - dtypes:
dict (optional)
dictionary of column names and datatypes as values
default velue is None - band_suffix:
string (optional)
suffix that would be added to column name after banding
default value is '_banded'
Methods:
- add_factor:
method to add new factor
Parameters:
fac_name -- name of new factor
fac_data_train -- data to be added to train data
fac_data_test -- data to be added to test data (optional) - null_stats:
method that generates missed values statistics
Parameters:
type_ -- train or test data flag - band:
method that bands given factor
banding here is cutting off all levels that have bigger number of entries then treshold for categorical
for numerical on the other hand we use list to create bins
Parameters:
factor -- factor to be banded
threshold -- value for cut off
banding_list -- list for binning
label_list -- list to generate labels
type_ -- type of variable
can be 'category' for categorical 'numeric' for numerical drop_factor -- should programme drop old factor? True if Yes. - band_list:
method that allows to use band method to many factors at once
this method is decorated with tqdm
Parameters:
factor_list_dict -- factors to be banded and tresholds or bins lists
threshold -- value for cut off
drop_factor -- should programme drop old factor? True if Yes. - one_way_plots:
method that allows to plot many factors at once (matplotlib)
Parameters:
factor_list -- factors to be plotted
Save -- Should plot be saved?
File_path -- Path to file - one_way_plots_plotly:
method that allows to plot many factors at once (plotly)
- missed_values_test:
returns dictionary with lists of excess levels per factor
- replace_NA:
method for replacing NaNs
this metod works different for different data types
Parameters:
factor -- factors to be processed
type_ can be 'NA' for categorical
'Mode' for categorical or numerical
'Numeric' for numerical - get_value_counts