DSDA_2020 (Data Science and Data Analytics: Opportunities and Challenges)

High Dimensionality Dataset Reduction Methodologies in Applied Machine Learning

Farhan Hai Khan^a,Tannistha Pal^b

a. Department of Electrical Engineering, Institute of Engineering & Management, Kolkata, India, khanfarhanpro@gmail.com
b. Department of Electronics and Communication Engineering, Institute of Engineering & Management, Kolkata, India, paltannistha@gmail.com

Abstract

A common problem faced while handling multi-featured datasets is the high amount of dimensionality that they often consist of, leading to barriers in generalized hands-on Machine Learning. These datasets also give a drastic impact on the performance of Machine Learning algorithms, being memory inefficient and frequently leading to model overfitting. It often becomes difficult to visualize or gain insightful knowledge on the data features such as presence of outliers.

This chapter will help data analysts reduce data dimensionality using various methodologies such as:

Feature Selection using Covariance Matrix
Principal Component Analysis (PCA)
t-distributed Stochastic Neighbour Embedding (t-SNE)

Under applications of Dimensionality Reduction Algorithms with Visualizations, firstly, we introduce the Boston Housing Dataset and use the Correlation Matrix to apply Feature Selection on the strongly positive correlated data and perform Simple Linear Regression over the new features.Then we use UCI Breast Cancer Dataset to perform PCA Analysis with Support Vector Machine Classification (SVM). Lastly, we apply t-SNE to MNIST Handwritten Digits Dataset and use k-Nearest Neighbours (kNNs) clustering for classification.

Finally, we explore the benefits of using Dimensionality Reduction Methods and provide a comprehensive overview of reduction in storage space, efficient models,feature selection guidelines ,redundant data removal and outlier analysis.

Keywords : Dimensionality Reduction, Feature Selection, Covariance Matrix, PCA , t-SNE

Problems faced with Multi-Dimensional Datasets
1. Data Intuition
2. Data Visualization Constraints
3. Outlier Detection
Dimensionality Reduction Algorithms with Visualizations
1. Feature Selection using Covariance Matrix
2. Principal Component Analysis (PCA)
3. t-distributed Stochastic Neighbour Embedding (t-SNE)
Benefits of Dimensionality Reduction
1. Storage Space Reduction
2. Computation Time Optimization
3. Redundant Feature Removal
4. Incorrect Data Removal

Problems faced with Multi-Dimensional Datasets

TODO

Dimensionality Reduction Algorithms with Visualizations

Feature Selection using Covariance Matrix

Objectives : Introduce Boston Housing Dataset and use the Correlation Matrix to apply Feature Selection on the strongly positive correlated data and perform Simple Linear Regression over the new features.

https://towardsdatascience.com/linear-regression-on-boston-housing-dataset-f409b7e4a155

https://www.geeksforgeeks.org/ml-boston-housing-kaggle-challenge-with-linear-regression/

# We will need 3 datasets for this chapter, each of which have been documented on our github repository.
# So let us create a local copy (clone) of that repo here :)
!git clone www.github.com/khanfarhan10/DIMENSIONALITY_REDUCTION.git

<imports.py>

# Firstly we will import all the necessary libraries that we will be requiring for Dataset Reductions. 

import numpy as np              # Mathematical Functions , Linear Algebra, Matrix Operations
import pandas as pd             # Data Manipulations,  Data Analysis/Storing/Preparation


import matplotlib.pyplot as plt # Simple Data Visualization , Basic Plotting Utilities
import seaborn as sns           # Advanced Data Visualization, High Level Figures Interfacing

%matplotlib inline           
# Interactive Jupyter Notebook Plotting
#%matplotlib inline             #This can be used as an alternative but the plots obtained will be non interactive in nature.

/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

# Importing Data 
from sklearn.datasets import load_boston
boston_dataset = load_boston()
df = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
df['MEDV'] = boston_dataset.target
df.head(10)
#NON EXISTENT, df.to_excel("Boston_Data.xlsx",index=False)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MEDV
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	36.2
5	0.02985	0.0	2.18	0.458	6.430	58.7	6.0622	3.0	222.0	18.7	394.12	5.21	28.7
6	0.08829	12.5	7.87	0.524	6.012	66.6	5.5605	5.0	311.0	15.2	395.60	12.43	22.9
7	0.14455	12.5	7.87	0.524	6.172	96.1	5.9505	5.0	311.0	15.2	396.90	19.15	27.1
8	0.21124	12.5	7.87	0.524	5.631	100.0	6.0821	5.0	311.0	15.2	386.63	29.93	16.5
9	0.17004	12.5	7.87	0.524	6.004	85.9	6.5921	5.0	311.0	15.2	386.71	17.10	18.9

boston_dataset.keys()

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])

boston_dataset.feature_names

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

boston_dataset.DESCR

".. _boston_dataset:\n\nBoston house prices dataset\n---------------------------\n\n**Data Set Characteristics:**  \n\n    :Number of Instances: 506 \n\n    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.\n\n    :Attribute Information (in order):\n        - CRIM     per capita crime rate by town\n        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.\n        - INDUS    proportion of non-retail business acres per town\n        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\n        - NOX      nitric oxides concentration (parts per 10 million)\n        - RM       average number of rooms per dwelling\n        - AGE      proportion of owner-occupied units built prior to 1940\n        - DIS      weighted distances to five Boston employment centres\n        - RAD      index of accessibility to radial highways\n        - TAX      full-value property-tax rate per $10,000\n        - PTRATIO  pupil-teacher ratio by town\n        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\n        - LSTAT    % lower status of the population\n        - MEDV     Median value of owner-occupied homes in $1000's\n\n    :Missing Attribute Values: None\n\n    :Creator: Harrison, D. and Rubinfeld, D.L.\n\nThis is a copy of UCI ML housing dataset.\nhttps://archive.ics.uci.edu/ml/machine-learning-databases/housing/\n\n\nThis dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.\n\nThe Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic\nprices and the demand for clean air', J. Environ. Economics & Management,\nvol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics\n...', Wiley, 1980.   N.B. Various transformations are used in the table on\npages 244-261 of the latter.\n\nThe Boston house-price data has been used in many machine learning papers that address regression\nproblems.   \n     \n.. topic:: References\n\n   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.\n   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.\n"

boston_dataset.filename

'/usr/local/lib/python3.6/dist-packages/sklearn/datasets/data/boston_house_prices.csv'

#ALTERNATIVELY,
df=pd.read_excel("DIMENSIONALITY_REDUCTION/data/Boston_Data.xlsx")

df.isnull().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
MEDV       0
dtype: int64

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    float64
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    float64
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  MEDV     506 non-null    float64
dtypes: float64(14)
memory usage: 55.5 KB

df.describe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MEDV
count	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000
mean	3.613524	11.363636	11.136779	0.069170	0.554695	6.284634	68.574901	3.795043	9.549407	408.237154	18.455534	356.674032	12.653063	22.532806
std	8.601545	23.322453	6.860353	0.253994	0.115878	0.702617	28.148861	2.105710	8.707259	168.537116	2.164946	91.294864	7.141062	9.197104
min	0.006320	0.000000	0.460000	0.000000	0.385000	3.561000	2.900000	1.129600	1.000000	187.000000	12.600000	0.320000	1.730000	5.000000
25%	0.082045	0.000000	5.190000	0.000000	0.449000	5.885500	45.025000	2.100175	4.000000	279.000000	17.400000	375.377500	6.950000	17.025000
50%	0.256510	0.000000	9.690000	0.000000	0.538000	6.208500	77.500000	3.207450	5.000000	330.000000	19.050000	391.440000	11.360000	21.200000
75%	3.677083	12.500000	18.100000	0.000000	0.624000	6.623500	94.075000	5.188425	24.000000	666.000000	20.200000	396.225000	16.955000	25.000000
max	88.976200	100.000000	27.740000	1.000000	0.871000	8.780000	100.000000	12.126500	24.000000	711.000000	22.000000	396.900000	37.970000	50.000000

plt.style.use("dark_background")

COLORS=["lime"]*13
COLORS+=["red"]

df.hist(bins=30,figsize=(20,10),grid=False,color="crimson"); #cyan , magenta , crimson,  #column="MEDV",,color='lime'
plt.title("Boston Dataset : Frequency Distribution of Numerical Data")
plt.savefig("Boston Dataset Frequency Distribution of Numerical Data.png",dpi=600)

correlation_matrix = df.corr().round(1)
# annot = True to print the values inside the square
plt.figure(figsize=(20,10))
sns.heatmap(data=correlation_matrix,cmap="inferno", annot=True)
plt.savefig("Correlation_Data.png",dpi=600)

df.columns

Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT', 'MEDV'],
      dtype='object')

X = df.drop(columns=['MEDV'])
y = df['MEDV']

print('Shape of X : {} , Shape of y : {}'.format(X.shape,y.shape))

Shape of X : (506, 13) , Shape of y : (506,)

# Normalization of the Data
from sklearn.preprocessing import StandardScaler

scaler_X = StandardScaler()
X_norm= scaler_X.fit_transform(X)

scaler_y = StandardScaler()
y_norm= scaler_y.fit_transform(X)

"abc {} xyz {}".format(1,2)

'abc 1 xyz 2'

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
print('Shape of X_train : {} , Shape of X_test : {}'.format(X_train.shape,X_test.shape))
print('Shape of y_train : {} , Shape of y_test : {}'.format(y_train.shape,X_test.shape))

Shape of X_train : (354, 13) , Shape of X_test : (152, 13)
Shape of y_train : (354,) , Shape of y_test : (152, 13)

from sklearn.linear_model import LinearRegression

lin_model = LinearRegression()
lin_model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

from sklearn.metrics import mean_squared_error, mean_absolute_error, accuracy_score

accLR=lin_model.score(X_test,y_test)
print(accLR)

0.7112260057484874

lin_model.coef_ #theta 1-13

array([-1.33470103e-01,  3.58089136e-02,  4.95226452e-02,  3.11983512e+00,
       -1.54170609e+01,  4.05719923e+00, -1.08208352e-02, -1.38599824e+00,
        2.42727340e-01, -8.70223437e-03, -9.10685208e-01,  1.17941159e-02,
       -5.47113313e-01])

lin_model.coef_.shape

(13,)

lin_model.intercept_ #theta 0

31.631084035691632

df.columns

Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT', 'MEDV'],
      dtype='object')

# Coefficient & Intercept :
def merge_list_to_dict(test_keys,test_values):
  """Uses dictionary comprehension to create a dict from 2 lists"""
  merged_dict = {test_keys[i]: test_values[i] for i in range(len(test_keys))}
  return merged_dict

def get_params(model_intercept,model_coefficients,data_cols):
  """Returns a dataframe of organised values for model parameters output with colums of the original dataframe"""
  res_theta=[model_intercept ]+ list(model_coefficients)
  res_y=['INTERCEPT']+list(data_cols)
  dict_res=merge_list_to_dict(res_y,res_theta)
  data=[dict_res]
  results=pd.DataFrame(data)
  return results

res=get_params(lin_model.intercept_,lin_model.coef_,X.columns)
res.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	INTERCEPT	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	31.631084	-0.13347	0.035809	0.049523	3.119835	-15.417061	4.057199	-0.010821	-1.385998	0.242727	-0.008702	-0.910685	0.011794	-0.547113

help(get_params)

Help on function get_params in module __main__:

get_params(model_intercept, model_coefficients, data_cols)
    Returns a dataframe of organised values for model parameters output with colums of the original dataframe

def tanni():
  """Meow meow meow"""
  return 1
help(tanni)

Help on function tanni in module __main__:

tanni()
    MEow meow meow

# Lasso/Ridge Regression

preds=lin_model.predict(X_test)

preds.mean()

21.33015845218848

y_test.mean()

21.407894736842103

temp=[]
for a,b in zip(preds,y_test):
  temp.append(abs(a-b))
temp=np.array(temp)/len(temp)
1-temp.mean()

0.9791926982140957

df.columns

Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT', 'MEDV'],
      dtype='object')

round(2.7346374,2)

2.73

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

def getevaluation(model, X_subset,y_subset,subset_type="Train",round_scores=None):
  y_subset_predict = model.predict(X_subset)
  rmse = (np.sqrt(mean_squared_error(y_subset, y_subset_predict)))
  r2 = r2_score(y_subset, y_subset_predict)
  mae=mean_absolute_error(y_subset, y_subset_predict)
  if round_scores!=None:
    rmse=round(rmse,round_scores)
    r2=round(r2,round_scores)
    mae=round(mae,round_scores)

  print("Model Performance for {} subset ::\nRMSE: {} | R2 score: {} | MAE: {}".format(subset_type,rmse,r2,mae))

getevaluation(model=lin_model,X_subset=X_train,y_subset=y_train,subset_type="Train",round_scores=2)
getevaluation(model=lin_model,X_subset=X_test,y_subset=y_test,subset_type="Test",round_scores=2)

Model Performance for Train subset ::
RMSE: 4.75 | R2 score: 0.74 | MAE: 3.36
Model Performance for Test subset ::
RMSE: 4.64 | R2 score: 0.71 | MAE: 3.16

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# model evaluation for training set
y_train_predict = lin_model.predict(X_train)



rmse = (np.sqrt(mean_squared_error(y_train, y_train_predict)))
r2 = r2_score(y_train, y_train_predict)

print("The model performance for training set")
print("--------------------------------------")
print('RMSE is {}'.format(rmse))
print('R2 score is {}'.format(r2))
print("\n")

# model evaluation for testing set
y_test_predict = lin_model.predict(X_test)
rmse = (np.sqrt(mean_squared_error(y_test, y_test_predict)))
r2 = r2_score(y_test, y_test_predict)

print("The model performance for testing set")
print("--------------------------------------")
print('RMSE is {}'.format(rmse))
print('R2 score is {}'.format(r2))

The model performance for training set
--------------------------------------
RMSE is 4.748208239685937
R2 score is 0.7434997532004697


The model performance for testing set
--------------------------------------
RMSE is 4.638689926172867
R2 score is 0.7112260057484874

sns.pairplot(df);

<seaborn.axisgrid.PairGrid at 0x7f7d53b7c828>

References

abc
bhshsh
jhgde

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
nb_files		nb_files
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DSDA_2020 (Data Science and Data Analytics: Opportunities and Challenges)

High Dimensionality Dataset Reduction Methodologies in Applied Machine Learning

Abstract

Keywords : Dimensionality Reduction, Feature Selection, Covariance Matrix, PCA , t-SNE

Table of Contents

Problems faced with Multi-Dimensional Datasets

Dimensionality Reduction Algorithms with Visualizations

Feature Selection using Covariance Matrix

References

About

Uh oh!

Releases

Packages

paltannistha/DIMENSIONALITY_REDUCTION

Folders and files

Latest commit

History

Repository files navigation

DSDA_2020 (Data Science and Data Analytics: Opportunities and Challenges)

High Dimensionality Dataset Reduction Methodologies in Applied Machine Learning

Abstract

Keywords : Dimensionality Reduction, Feature Selection, Covariance Matrix, PCA , t-SNE

Table of Contents

Problems faced with Multi-Dimensional Datasets

Dimensionality Reduction Algorithms with Visualizations

Feature Selection using Covariance Matrix

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages