Skip to content

Taylor and Francis Book Publication (Routledge) : DIMENSIONALITY REDUCTION ALGORITHMS IN APPLIED MACHINE LEARNING

Notifications You must be signed in to change notification settings

paltannistha/DIMENSIONALITY_REDUCTION

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

DSDA_2020 (Data Science and Data Analytics: Opportunities and Challenges)

High Dimensionality Dataset Reduction Methodologies in Applied Machine Learning

Farhan Hai Khana,Tannistha Palb

a. Department of Electrical Engineering, Institute of Engineering & Management, Kolkata, India, khanfarhanpro@gmail.com
b. Department of Electronics and Communication Engineering, Institute of Engineering & Management, Kolkata, India, paltannistha@gmail.com

Abstract

A common problem faced while handling multi-featured datasets is the high amount of dimensionality that they often consist of, leading to barriers in generalized hands-on Machine Learning. These datasets also give a drastic impact on the performance of Machine Learning algorithms, being memory inefficient and frequently leading to model overfitting. It often becomes difficult to visualize or gain insightful knowledge on the data features such as presence of outliers.

This chapter will help data analysts reduce data dimensionality using various methodologies such as:

  1. Feature Selection using Covariance Matrix
  2. Principal Component Analysis (PCA)
  3. t-distributed Stochastic Neighbour Embedding (t-SNE)

Under applications of Dimensionality Reduction Algorithms with Visualizations, firstly, we introduce the Boston Housing Dataset and use the Correlation Matrix to apply Feature Selection on the strongly positive correlated data and perform Simple Linear Regression over the new features.Then we use UCI Breast Cancer Dataset to perform PCA Analysis with Support Vector Machine Classification (SVM). Lastly, we apply t-SNE to MNIST Handwritten Digits Dataset and use k-Nearest Neighbours (kNNs) clustering for classification.

Finally, we explore the benefits of using Dimensionality Reduction Methods and provide a comprehensive overview of reduction in storage space, efficient models,feature selection guidelines ,redundant data removal and outlier analysis.

Keywords : Dimensionality Reduction, Feature Selection, Covariance Matrix, PCA , t-SNE

Table of Contents

  1. Problems faced with Multi-Dimensional Datasets
    1. Data Intuition
    2. Data Visualization Constraints
    3. Outlier Detection
  2. Dimensionality Reduction Algorithms with Visualizations
    1. Feature Selection using Covariance Matrix
    2. Principal Component Analysis (PCA)
    3. t-distributed Stochastic Neighbour Embedding (t-SNE)
  3. Benefits of Dimensionality Reduction
    1. Storage Space Reduction
    2. Computation Time Optimization
    3. Redundant Feature Removal
    4. Incorrect Data Removal

Problems faced with Multi-Dimensional Datasets

TODO

Dimensionality Reduction Algorithms with Visualizations

Feature Selection using Covariance Matrix

Objectives : Introduce Boston Housing Dataset and use the Correlation Matrix to apply Feature Selection on the strongly positive correlated data and perform Simple Linear Regression over the new features.

https://towardsdatascience.com/linear-regression-on-boston-housing-dataset-f409b7e4a155

https://www.geeksforgeeks.org/ml-boston-housing-kaggle-challenge-with-linear-regression/

# We will need 3 datasets for this chapter, each of which have been documented on our github repository.
# So let us create a local copy (clone) of that repo here :)
!git clone www.github.com/khanfarhan10/DIMENSIONALITY_REDUCTION.git

<imports.py>

# Firstly we will import all the necessary libraries that we will be requiring for Dataset Reductions. 

import numpy as np              # Mathematical Functions , Linear Algebra, Matrix Operations
import pandas as pd             # Data Manipulations,  Data Analysis/Storing/Preparation


import matplotlib.pyplot as plt # Simple Data Visualization , Basic Plotting Utilities
import seaborn as sns           # Advanced Data Visualization, High Level Figures Interfacing

%matplotlib inline           
# Interactive Jupyter Notebook Plotting
#%matplotlib inline             #This can be used as an alternative but the plots obtained will be non interactive in nature.
/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm
# Importing Data 
from sklearn.datasets import load_boston
boston_dataset = load_boston()
df = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
df['MEDV'] = boston_dataset.target
df.head(10)
#NON EXISTENT, df.to_excel("Boston_Data.xlsx",index=False)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2
5 0.02985 0.0 2.18 0.0 0.458 6.430 58.7 6.0622 3.0 222.0 18.7 394.12 5.21 28.7
6 0.08829 12.5 7.87 0.0 0.524 6.012 66.6 5.5605 5.0 311.0 15.2 395.60 12.43 22.9
7 0.14455 12.5 7.87 0.0 0.524 6.172 96.1 5.9505 5.0 311.0 15.2 396.90 19.15 27.1
8 0.21124 12.5 7.87 0.0 0.524 5.631 100.0 6.0821 5.0 311.0 15.2 386.63 29.93 16.5
9 0.17004 12.5 7.87 0.0 0.524 6.004 85.9 6.5921 5.0 311.0 15.2 386.71 17.10 18.9

boston_dataset.keys()
dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])
boston_dataset.feature_names
array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')
boston_dataset.DESCR
".. _boston_dataset:\n\nBoston house prices dataset\n---------------------------\n\n**Data Set Characteristics:**  \n\n    :Number of Instances: 506 \n\n    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.\n\n    :Attribute Information (in order):\n        - CRIM     per capita crime rate by town\n        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.\n        - INDUS    proportion of non-retail business acres per town\n        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\n        - NOX      nitric oxides concentration (parts per 10 million)\n        - RM       average number of rooms per dwelling\n        - AGE      proportion of owner-occupied units built prior to 1940\n        - DIS      weighted distances to five Boston employment centres\n        - RAD      index of accessibility to radial highways\n        - TAX      full-value property-tax rate per $10,000\n        - PTRATIO  pupil-teacher ratio by town\n        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\n        - LSTAT    % lower status of the population\n        - MEDV     Median value of owner-occupied homes in $1000's\n\n    :Missing Attribute Values: None\n\n    :Creator: Harrison, D. and Rubinfeld, D.L.\n\nThis is a copy of UCI ML housing dataset.\nhttps://archive.ics.uci.edu/ml/machine-learning-databases/housing/\n\n\nThis dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.\n\nThe Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic\nprices and the demand for clean air', J. Environ. Economics & Management,\nvol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics\n...', Wiley, 1980.   N.B. Various transformations are used in the table on\npages 244-261 of the latter.\n\nThe Boston house-price data has been used in many machine learning papers that address regression\nproblems.   \n     \n.. topic:: References\n\n   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.\n   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.\n"
boston_dataset.filename
'/usr/local/lib/python3.6/dist-packages/sklearn/datasets/data/boston_house_prices.csv'
#ALTERNATIVELY,
df=pd.read_excel("DIMENSIONALITY_REDUCTION/data/Boston_Data.xlsx")
df.isnull().sum()
CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
MEDV       0
dtype: int64
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    float64
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    float64
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  MEDV     506 non-null    float64
dtypes: float64(14)
memory usage: 55.5 KB
df.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000
mean 3.613524 11.363636 11.136779 0.069170 0.554695 6.284634 68.574901 3.795043 9.549407 408.237154 18.455534 356.674032 12.653063 22.532806
std 8.601545 23.322453 6.860353 0.253994 0.115878 0.702617 28.148861 2.105710 8.707259 168.537116 2.164946 91.294864 7.141062 9.197104
min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000 2.900000 1.129600 1.000000 187.000000 12.600000 0.320000 1.730000 5.000000
25% 0.082045 0.000000 5.190000 0.000000 0.449000 5.885500 45.025000 2.100175 4.000000 279.000000 17.400000 375.377500 6.950000 17.025000
50% 0.256510 0.000000 9.690000 0.000000 0.538000 6.208500 77.500000 3.207450 5.000000 330.000000 19.050000 391.440000 11.360000 21.200000
75% 3.677083 12.500000 18.100000 0.000000 0.624000 6.623500 94.075000 5.188425 24.000000 666.000000 20.200000 396.225000 16.955000 25.000000
max 88.976200 100.000000 27.740000 1.000000 0.871000 8.780000 100.000000 12.126500 24.000000 711.000000 22.000000 396.900000 37.970000 50.000000
plt.style.use("dark_background")
COLORS=["lime"]*13
COLORS+=["red"]
df.hist(bins=30,figsize=(20,10),grid=False,color="crimson"); #cyan , magenta , crimson,  #column="MEDV",,color='lime'
plt.title("Boston Dataset : Frequency Distribution of Numerical Data")
plt.savefig("Boston Dataset Frequency Distribution of Numerical Data.png",dpi=600)

png

correlation_matrix = df.corr().round(1)
# annot = True to print the values inside the square
plt.figure(figsize=(20,10))
sns.heatmap(data=correlation_matrix,cmap="inferno", annot=True)
plt.savefig("Correlation_Data.png",dpi=600)

png

df.columns
Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT', 'MEDV'],
      dtype='object')
X = df.drop(columns=['MEDV'])
y = df['MEDV']

print('Shape of X : {} , Shape of y : {}'.format(X.shape,y.shape))
Shape of X : (506, 13) , Shape of y : (506,)
# Normalization of the Data
from sklearn.preprocessing import StandardScaler

scaler_X = StandardScaler()
X_norm= scaler_X.fit_transform(X)

scaler_y = StandardScaler()
y_norm= scaler_y.fit_transform(X)


"abc {} xyz {}".format(1,2)
'abc 1 xyz 2'
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
print('Shape of X_train : {} , Shape of X_test : {}'.format(X_train.shape,X_test.shape))
print('Shape of y_train : {} , Shape of y_test : {}'.format(y_train.shape,X_test.shape))
Shape of X_train : (354, 13) , Shape of X_test : (152, 13)
Shape of y_train : (354,) , Shape of y_test : (152, 13)
from sklearn.linear_model import LinearRegression

lin_model = LinearRegression()
lin_model.fit(X_train, y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
from sklearn.metrics import mean_squared_error, mean_absolute_error, accuracy_score

accLR=lin_model.score(X_test,y_test)
print(accLR)
0.7112260057484874
lin_model.coef_ #theta 1-13
array([-1.33470103e-01,  3.58089136e-02,  4.95226452e-02,  3.11983512e+00,
       -1.54170609e+01,  4.05719923e+00, -1.08208352e-02, -1.38599824e+00,
        2.42727340e-01, -8.70223437e-03, -9.10685208e-01,  1.17941159e-02,
       -5.47113313e-01])
lin_model.coef_.shape
(13,)
lin_model.intercept_ #theta 0
31.631084035691632
df.columns
Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT', 'MEDV'],
      dtype='object')
# Coefficient & Intercept :
def merge_list_to_dict(test_keys,test_values):
  """Uses dictionary comprehension to create a dict from 2 lists"""
  merged_dict = {test_keys[i]: test_values[i] for i in range(len(test_keys))}
  return merged_dict

def get_params(model_intercept,model_coefficients,data_cols):
  """Returns a dataframe of organised values for model parameters output with colums of the original dataframe"""
  res_theta=[model_intercept ]+ list(model_coefficients)
  res_y=['INTERCEPT']+list(data_cols)
  dict_res=merge_list_to_dict(res_y,res_theta)
  data=[dict_res]
  results=pd.DataFrame(data)
  return results

res=get_params(lin_model.intercept_,lin_model.coef_,X.columns)
res.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
INTERCEPT CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 31.631084 -0.13347 0.035809 0.049523 3.119835 -15.417061 4.057199 -0.010821 -1.385998 0.242727 -0.008702 -0.910685 0.011794 -0.547113
help(get_params)
Help on function get_params in module __main__:

get_params(model_intercept, model_coefficients, data_cols)
    Returns a dataframe of organised values for model parameters output with colums of the original dataframe
def tanni():
  """Meow meow meow"""
  return 1
help(tanni)
Help on function tanni in module __main__:

tanni()
    MEow meow meow
# Lasso/Ridge Regression






preds=lin_model.predict(X_test)

preds.mean()
21.33015845218848
y_test.mean()
21.407894736842103
temp=[]
for a,b in zip(preds,y_test):
  temp.append(abs(a-b))
temp=np.array(temp)/len(temp)
1-temp.mean()
0.9791926982140957


df.columns
Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT', 'MEDV'],
      dtype='object')



round(2.7346374,2)
2.73
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

def getevaluation(model, X_subset,y_subset,subset_type="Train",round_scores=None):
  y_subset_predict = model.predict(X_subset)
  rmse = (np.sqrt(mean_squared_error(y_subset, y_subset_predict)))
  r2 = r2_score(y_subset, y_subset_predict)
  mae=mean_absolute_error(y_subset, y_subset_predict)
  if round_scores!=None:
    rmse=round(rmse,round_scores)
    r2=round(r2,round_scores)
    mae=round(mae,round_scores)

  print("Model Performance for {} subset ::\nRMSE: {} | R2 score: {} | MAE: {}".format(subset_type,rmse,r2,mae))

getevaluation(model=lin_model,X_subset=X_train,y_subset=y_train,subset_type="Train",round_scores=2)
getevaluation(model=lin_model,X_subset=X_test,y_subset=y_test,subset_type="Test",round_scores=2)
Model Performance for Train subset ::
RMSE: 4.75 | R2 score: 0.74 | MAE: 3.36
Model Performance for Test subset ::
RMSE: 4.64 | R2 score: 0.71 | MAE: 3.16







from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# model evaluation for training set
y_train_predict = lin_model.predict(X_train)



rmse = (np.sqrt(mean_squared_error(y_train, y_train_predict)))
r2 = r2_score(y_train, y_train_predict)

print("The model performance for training set")
print("--------------------------------------")
print('RMSE is {}'.format(rmse))
print('R2 score is {}'.format(r2))
print("\n")

# model evaluation for testing set
y_test_predict = lin_model.predict(X_test)
rmse = (np.sqrt(mean_squared_error(y_test, y_test_predict)))
r2 = r2_score(y_test, y_test_predict)

print("The model performance for testing set")
print("--------------------------------------")
print('RMSE is {}'.format(rmse))
print('R2 score is {}'.format(r2))

The model performance for training set
--------------------------------------
RMSE is 4.748208239685937
R2 score is 0.7434997532004697


The model performance for testing set
--------------------------------------
RMSE is 4.638689926172867
R2 score is 0.7112260057484874






sns.pairplot(df);
<seaborn.axisgrid.PairGrid at 0x7f7d53b7c828>

png




References

  1. abc
  2. bhshsh
  3. jhgde

About

Taylor and Francis Book Publication (Routledge) : DIMENSIONALITY REDUCTION ALGORITHMS IN APPLIED MACHINE LEARNING

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published