Skip to content

antirrabia/Machine-Learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Machine Learning Repository for High-Performing Classification and Regression Models

Welcome to my GitHub repository dedicated to developing state-of-the-art supervised machine learning algorithms for classification and regression tasks. My primary focus is on achieving excellent scores in Kaggle.com competitions, where I strive to push the boundaries of model performance.

Join me on this exciting journey of developing high-performing classification and regression models. Let's unlock the full potential of supervised learning and make our mark in Kaggle.com competitions. Happy coding, and may our models achieve excellence!


Classification

Titanic - A classification problem on this dataset. I have implemented some best practices, developing three models of increasing complexity.

Skills: Data wrangling and cleaning | Feature engineering | Model training and evaluation | Hyperparameter tuning | Ensemble learning.

Accomplishments: Developed custom functions to create new features based on domain knowledge. | Utilized IterativeImputer and KNNImputer to handle missing values. | Tuned hyperparameters using GridSearchCV. | Enhanced my functions into transformers using FunctionTransformer. | Employed SimpleImpute with different strategies for numeric and categorical features. | Improved my custom transformers, incorporating the Cabin feature.

  • Basic - wrote my own functions to create new features and code them using domain knowledge. After having all the features coded, I use IterativeImputer and KNNImputer. I tried different parameters for imputation_order and n_neighbors. I created a main Pipeline and a GridSearchCV to tune hyper-parameters.
  • 02 - I improved and turned my functions into transformers using FunctionTransformer which I then used in FeatureUnion . This time I wanted to try SimpleImpute and the strategies of mean, median, constant, most_frequent for numeric features and most_frequent for categorical features.
  • 03 - I improved my own transformers. This time, I included and coded the Cabin feature, which I didn't use in the previous 2 procedures. I use IterativeImputer and KNNImputer. I tried different parameters for imputation_order and n_neighbors .

Regression


House Prices - A regression problem.

Final Version

General Info
- Every step in the process is a function, (e.g. Imputation, Reducing Cardinality).
- Create custom transformers, so I can impute nan values in a Pipeline.
- I preserved columns name, when I encode Ordinal and Nominal variables, by using the category_encoders library

Outliers:
- I remove them by Domain knowledge,
- Applying IsolationForest
- Analyzing Residuals (e.g. > than 3 standard deviation)

Categorical features:
- To reduce the number of dimensions that OneHotEncoder will produce:
- I collapse the less frequent categories into a single category ‘Others’ or
- I create clusters using KMeans algorithm from scikit-learn
- To encode an Ordinal variable, I created a dictionary that maps each category to the corresponding order. I then pass it to the OrdinalEncoder instance.
- To encode Nominal variables, I used OneHotEncoder from category_encoders
- Ordinal Cyclical variables: I calculated the SIN and COS components of them.

Numerical Features:
- I treated Skewness(> 0.7) of numerical variables with Yeo-Johnson technique in the PowerTransformer to make them more Gaussian Like.
- Scaling Variables using RobustScaler, because of outliers.

Feature Selection:
- By using mutual_info_regressio model
- Drop features with mutual info = 0.

Feature Engineering:
- By domain knowledge (e.g combine different measurements of the land)
- Create Polynomial interaction features from the top 20 features with highest mutual information

Feature Importance:
- I used the feature importance from XGBoost in SelectFromModel to filter out most important feature

Modeling:
I tune the hyperparameters manually or by GridSearchCV, RandomizedSearchCV for the following algorithms.
- TheilSenRegressor, HuberRegressor, RANSACRegressor, Ridge, ElasticNet, ElasticNetCV, LinearRegression, XGBRegressor, catboost.CatBoostRegressor
- Next I used them in a StackingRegressor and a LinearRegression as a Meta model.

About

Supervised learning algorithms used for classification and regression analysis.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published