Welcome to my GitHub repository dedicated to developing state-of-the-art supervised machine learning algorithms for classification and regression tasks. My primary focus is on achieving excellent scores in Kaggle.com competitions, where I strive to push the boundaries of model performance.
Join me on this exciting journey of developing high-performing classification and regression models. Let's unlock the full potential of supervised learning and make our mark in Kaggle.com competitions. Happy coding, and may our models achieve excellence!
Titanic - A classification problem on this dataset. I have implemented some best practices, developing three models of increasing complexity.
Skills: Data wrangling and cleaning | Feature engineering | Model training and evaluation | Hyperparameter tuning | Ensemble learning.
Accomplishments: Developed custom functions to create new features based on domain knowledge. | Utilized IterativeImputer and KNNImputer to handle missing values. | Tuned hyperparameters using GridSearchCV. | Enhanced my functions into transformers using FunctionTransformer. | Employed SimpleImpute with different strategies for numeric and categorical features. | Improved my custom transformers, incorporating the Cabin feature.
- Basic - wrote my own functions to create new features and code them using domain knowledge. After having all the features coded, I use
IterativeImputer
andKNNImputer
. I tried different parameters forimputation_order
andn_neighbors
. I created a mainPipeline
and aGridSearchCV
to tune hyper-parameters. - 02 - I improved and turned my functions into transformers using
FunctionTransformer
which I then used inFeatureUnion
. This time I wanted to trySimpleImpute
and the strategies ofmean
,median
,constant
,most_frequent
for numeric features andmost_frequent
for categorical features. - 03 - I improved my own transformers. This time, I included and coded the
Cabin
feature, which I didn't use in the previous 2 procedures. I useIterativeImputer
andKNNImputer
. I tried different parameters forimputation_order
andn_neighbors
.
House Prices - A regression problem.
General Info
- Every step in the process is a function
, (e.g. Imputation, Reducing Cardinality).
- Create custom transformers
, so I can impute nan
values in a Pipeline.
- I preserved columns name, when I encode Ordinal and Nominal variables, by using the category_encoders
library
Outliers:
- I remove them by Domain knowledge,
- Applying IsolationForest
- Analyzing Residuals (e.g. > than 3 standard deviation)
Categorical features:
- To reduce the number of dimensions that OneHotEncoder will produce:
- I collapse the less frequent categories into a single category ‘Others’ or
- I create clusters using KMeans
algorithm from scikit-learn
- To encode an Ordinal variable, I created a dictionary that maps each category to the corresponding order. I then pass it to the OrdinalEncoder instance.
- To encode Nominal variables, I used OneHotEncoder from category_encoders
- Ordinal Cyclical variables: I calculated the SIN and COS components of them.
Numerical Features:
- I treated Skewness(> 0.7) of numerical variables with Yeo-Johnson technique in the PowerTransformer
to make them more Gaussian Like.
- Scaling Variables using RobustScaler, because of outliers.
Feature Selection:
- By using mutual_info_regressio
model
- Drop features with mutual info = 0.
Feature Engineering:
- By domain knowledge (e.g combine different measurements of the land)
- Create Polynomial interaction features from the top 20 features with highest mutual information
Feature Importance:
- I used the feature importance from XGBoost in SelectFromModel to filter out most important feature
Modeling:
I tune the hyperparameters manually or by GridSearchCV, RandomizedSearchCV for the following algorithms.
- TheilSenRegressor, HuberRegressor, RANSACRegressor, Ridge, ElasticNet, ElasticNetCV, LinearRegression, XGBRegressor, catboost.CatBoostRegressor
- Next I used them in a StackingRegressor and a LinearRegression as a Meta model.