Farhan Hai Khana,Tannistha Palb
a. Department of Electrical Engineering, Institute of Engineering & Management, Kolkata, India, khanfarhanpro@gmail.com
b. Department of Electronics and Communication Engineering, Institute of Engineering & Management, Kolkata, India, paltannistha@gmail.com
A common problem faced while handling multi-featured datasets is the high amount of dimensionality that they often consist of, leading to barriers in generalized hands-on Machine Learning. These datasets also give a drastic impact on the performance of Machine Learning algorithms, being memory inefficient and frequently leading to model overfitting. It often becomes difficult to visualize or gain insightful knowledge on the data features such as presence of outliers.
This chapter will help data analysts reduce data dimensionality using various methodologies such as:
- Feature Selection using Covariance Matrix
- Principal Component Analysis (PCA)
- t-distributed Stochastic Neighbour Embedding (t-SNE)
Under applications of Dimensionality Reduction Algorithms with Visualizations, firstly, we introduce the Boston Housing Dataset and use the Correlation Matrix to apply Feature Selection on the strongly positive correlated data and perform Simple Linear Regression over the new features.Then we use UCI Breast Cancer Dataset to perform PCA Analysis with Support Vector Machine Classification (SVM). Lastly, we apply t-SNE to MNIST Handwritten Digits Dataset and use k-Nearest Neighbours (kNNs) clustering for classification.
Finally, we explore the benefits of using Dimensionality Reduction Methods and provide a comprehensive overview of reduction in storage space, efficient models,feature selection guidelines ,redundant data removal and outlier analysis.
- Problems faced with Multi-Dimensional Datasets
- Data Intuition
- Data Visualization Constraints
- Outlier Detection
- Dimensionality Reduction Algorithms with Visualizations
- Feature Selection using Covariance Matrix
- Principal Component Analysis (PCA)
- t-distributed Stochastic Neighbour Embedding (t-SNE)
- Benefits of Dimensionality Reduction
- Storage Space Reduction
- Computation Time Optimization
- Redundant Feature Removal
- Incorrect Data Removal
TODO
Objectives : Introduce Boston Housing Dataset and use the Correlation Matrix to apply Feature Selection on the strongly positive correlated data and perform Simple Linear Regression over the new features.
https://towardsdatascience.com/linear-regression-on-boston-housing-dataset-f409b7e4a155
https://www.geeksforgeeks.org/ml-boston-housing-kaggle-challenge-with-linear-regression/
# We will need 3 datasets for this chapter, each of which have been documented on our github repository.
# So let us create a local copy (clone) of that repo here :)
!git clone www.github.com/khanfarhan10/DIMENSIONALITY_REDUCTION.git
<imports.py>
# Firstly we will import all the necessary libraries that we will be requiring for Dataset Reductions.
import numpy as np # Mathematical Functions , Linear Algebra, Matrix Operations
import pandas as pd # Data Manipulations, Data Analysis/Storing/Preparation
import matplotlib.pyplot as plt # Simple Data Visualization , Basic Plotting Utilities
import seaborn as sns # Advanced Data Visualization, High Level Figures Interfacing
%matplotlib inline
# Interactive Jupyter Notebook Plotting
#%matplotlib inline #This can be used as an alternative but the plots obtained will be non interactive in nature.
/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
import pandas.util.testing as tm
# Importing Data
from sklearn.datasets import load_boston
boston_dataset = load_boston()
df = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
df['MEDV'] = boston_dataset.target
df.head(10)
#NON EXISTENT, df.to_excel("Boston_Data.xlsx",index=False)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
5 | 0.02985 | 0.0 | 2.18 | 0.0 | 0.458 | 6.430 | 58.7 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.12 | 5.21 | 28.7 |
6 | 0.08829 | 12.5 | 7.87 | 0.0 | 0.524 | 6.012 | 66.6 | 5.5605 | 5.0 | 311.0 | 15.2 | 395.60 | 12.43 | 22.9 |
7 | 0.14455 | 12.5 | 7.87 | 0.0 | 0.524 | 6.172 | 96.1 | 5.9505 | 5.0 | 311.0 | 15.2 | 396.90 | 19.15 | 27.1 |
8 | 0.21124 | 12.5 | 7.87 | 0.0 | 0.524 | 5.631 | 100.0 | 6.0821 | 5.0 | 311.0 | 15.2 | 386.63 | 29.93 | 16.5 |
9 | 0.17004 | 12.5 | 7.87 | 0.0 | 0.524 | 6.004 | 85.9 | 6.5921 | 5.0 | 311.0 | 15.2 | 386.71 | 17.10 | 18.9 |
boston_dataset.keys()
dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])
boston_dataset.feature_names
array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')
boston_dataset.DESCR
".. _boston_dataset:\n\nBoston house prices dataset\n---------------------------\n\n**Data Set Characteristics:** \n\n :Number of Instances: 506 \n\n :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.\n\n :Attribute Information (in order):\n - CRIM per capita crime rate by town\n - ZN proportion of residential land zoned for lots over 25,000 sq.ft.\n - INDUS proportion of non-retail business acres per town\n - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\n - NOX nitric oxides concentration (parts per 10 million)\n - RM average number of rooms per dwelling\n - AGE proportion of owner-occupied units built prior to 1940\n - DIS weighted distances to five Boston employment centres\n - RAD index of accessibility to radial highways\n - TAX full-value property-tax rate per $10,000\n - PTRATIO pupil-teacher ratio by town\n - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\n - LSTAT % lower status of the population\n - MEDV Median value of owner-occupied homes in $1000's\n\n :Missing Attribute Values: None\n\n :Creator: Harrison, D. and Rubinfeld, D.L.\n\nThis is a copy of UCI ML housing dataset.\nhttps://archive.ics.uci.edu/ml/machine-learning-databases/housing/\n\n\nThis dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.\n\nThe Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic\nprices and the demand for clean air', J. Environ. Economics & Management,\nvol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics\n...', Wiley, 1980. N.B. Various transformations are used in the table on\npages 244-261 of the latter.\n\nThe Boston house-price data has been used in many machine learning papers that address regression\nproblems. \n \n.. topic:: References\n\n - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.\n - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.\n"
boston_dataset.filename
'/usr/local/lib/python3.6/dist-packages/sklearn/datasets/data/boston_house_prices.csv'
#ALTERNATIVELY,
df=pd.read_excel("DIMENSIONALITY_REDUCTION/data/Boston_Data.xlsx")
df.isnull().sum()
CRIM 0
ZN 0
INDUS 0
CHAS 0
NOX 0
RM 0
AGE 0
DIS 0
RAD 0
TAX 0
PTRATIO 0
B 0
LSTAT 0
MEDV 0
dtype: int64
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CRIM 506 non-null float64
1 ZN 506 non-null float64
2 INDUS 506 non-null float64
3 CHAS 506 non-null float64
4 NOX 506 non-null float64
5 RM 506 non-null float64
6 AGE 506 non-null float64
7 DIS 506 non-null float64
8 RAD 506 non-null float64
9 TAX 506 non-null float64
10 PTRATIO 506 non-null float64
11 B 506 non-null float64
12 LSTAT 506 non-null float64
13 MEDV 506 non-null float64
dtypes: float64(14)
memory usage: 55.5 KB
df.describe()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 |
mean | 3.613524 | 11.363636 | 11.136779 | 0.069170 | 0.554695 | 6.284634 | 68.574901 | 3.795043 | 9.549407 | 408.237154 | 18.455534 | 356.674032 | 12.653063 | 22.532806 |
std | 8.601545 | 23.322453 | 6.860353 | 0.253994 | 0.115878 | 0.702617 | 28.148861 | 2.105710 | 8.707259 | 168.537116 | 2.164946 | 91.294864 | 7.141062 | 9.197104 |
min | 0.006320 | 0.000000 | 0.460000 | 0.000000 | 0.385000 | 3.561000 | 2.900000 | 1.129600 | 1.000000 | 187.000000 | 12.600000 | 0.320000 | 1.730000 | 5.000000 |
25% | 0.082045 | 0.000000 | 5.190000 | 0.000000 | 0.449000 | 5.885500 | 45.025000 | 2.100175 | 4.000000 | 279.000000 | 17.400000 | 375.377500 | 6.950000 | 17.025000 |
50% | 0.256510 | 0.000000 | 9.690000 | 0.000000 | 0.538000 | 6.208500 | 77.500000 | 3.207450 | 5.000000 | 330.000000 | 19.050000 | 391.440000 | 11.360000 | 21.200000 |
75% | 3.677083 | 12.500000 | 18.100000 | 0.000000 | 0.624000 | 6.623500 | 94.075000 | 5.188425 | 24.000000 | 666.000000 | 20.200000 | 396.225000 | 16.955000 | 25.000000 |
max | 88.976200 | 100.000000 | 27.740000 | 1.000000 | 0.871000 | 8.780000 | 100.000000 | 12.126500 | 24.000000 | 711.000000 | 22.000000 | 396.900000 | 37.970000 | 50.000000 |
plt.style.use("dark_background")
COLORS=["lime"]*13
COLORS+=["red"]
df.hist(bins=30,figsize=(20,10),grid=False,color="crimson"); #cyan , magenta , crimson, #column="MEDV",,color='lime'
plt.title("Boston Dataset : Frequency Distribution of Numerical Data")
plt.savefig("Boston Dataset Frequency Distribution of Numerical Data.png",dpi=600)
correlation_matrix = df.corr().round(1)
# annot = True to print the values inside the square
plt.figure(figsize=(20,10))
sns.heatmap(data=correlation_matrix,cmap="inferno", annot=True)
plt.savefig("Correlation_Data.png",dpi=600)
df.columns
Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
'PTRATIO', 'B', 'LSTAT', 'MEDV'],
dtype='object')
X = df.drop(columns=['MEDV'])
y = df['MEDV']
print('Shape of X : {} , Shape of y : {}'.format(X.shape,y.shape))
Shape of X : (506, 13) , Shape of y : (506,)
# Normalization of the Data
from sklearn.preprocessing import StandardScaler
scaler_X = StandardScaler()
X_norm= scaler_X.fit_transform(X)
scaler_y = StandardScaler()
y_norm= scaler_y.fit_transform(X)
"abc {} xyz {}".format(1,2)
'abc 1 xyz 2'
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
print('Shape of X_train : {} , Shape of X_test : {}'.format(X_train.shape,X_test.shape))
print('Shape of y_train : {} , Shape of y_test : {}'.format(y_train.shape,X_test.shape))
Shape of X_train : (354, 13) , Shape of X_test : (152, 13)
Shape of y_train : (354,) , Shape of y_test : (152, 13)
from sklearn.linear_model import LinearRegression
lin_model = LinearRegression()
lin_model.fit(X_train, y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
from sklearn.metrics import mean_squared_error, mean_absolute_error, accuracy_score
accLR=lin_model.score(X_test,y_test)
print(accLR)
0.7112260057484874
lin_model.coef_ #theta 1-13
array([-1.33470103e-01, 3.58089136e-02, 4.95226452e-02, 3.11983512e+00,
-1.54170609e+01, 4.05719923e+00, -1.08208352e-02, -1.38599824e+00,
2.42727340e-01, -8.70223437e-03, -9.10685208e-01, 1.17941159e-02,
-5.47113313e-01])
lin_model.coef_.shape
(13,)
lin_model.intercept_ #theta 0
31.631084035691632
df.columns
Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
'PTRATIO', 'B', 'LSTAT', 'MEDV'],
dtype='object')
# Coefficient & Intercept :
def merge_list_to_dict(test_keys,test_values):
"""Uses dictionary comprehension to create a dict from 2 lists"""
merged_dict = {test_keys[i]: test_values[i] for i in range(len(test_keys))}
return merged_dict
def get_params(model_intercept,model_coefficients,data_cols):
"""Returns a dataframe of organised values for model parameters output with colums of the original dataframe"""
res_theta=[model_intercept ]+ list(model_coefficients)
res_y=['INTERCEPT']+list(data_cols)
dict_res=merge_list_to_dict(res_y,res_theta)
data=[dict_res]
results=pd.DataFrame(data)
return results
res=get_params(lin_model.intercept_,lin_model.coef_,X.columns)
res.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
INTERCEPT | CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 31.631084 | -0.13347 | 0.035809 | 0.049523 | 3.119835 | -15.417061 | 4.057199 | -0.010821 | -1.385998 | 0.242727 | -0.008702 | -0.910685 | 0.011794 | -0.547113 |
help(get_params)
Help on function get_params in module __main__:
get_params(model_intercept, model_coefficients, data_cols)
Returns a dataframe of organised values for model parameters output with colums of the original dataframe
def tanni():
"""Meow meow meow"""
return 1
help(tanni)
Help on function tanni in module __main__:
tanni()
MEow meow meow
# Lasso/Ridge Regression
preds=lin_model.predict(X_test)
preds.mean()
21.33015845218848
y_test.mean()
21.407894736842103
temp=[]
for a,b in zip(preds,y_test):
temp.append(abs(a-b))
temp=np.array(temp)/len(temp)
1-temp.mean()
0.9791926982140957
df.columns
Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
'PTRATIO', 'B', 'LSTAT', 'MEDV'],
dtype='object')
round(2.7346374,2)
2.73
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
def getevaluation(model, X_subset,y_subset,subset_type="Train",round_scores=None):
y_subset_predict = model.predict(X_subset)
rmse = (np.sqrt(mean_squared_error(y_subset, y_subset_predict)))
r2 = r2_score(y_subset, y_subset_predict)
mae=mean_absolute_error(y_subset, y_subset_predict)
if round_scores!=None:
rmse=round(rmse,round_scores)
r2=round(r2,round_scores)
mae=round(mae,round_scores)
print("Model Performance for {} subset ::\nRMSE: {} | R2 score: {} | MAE: {}".format(subset_type,rmse,r2,mae))
getevaluation(model=lin_model,X_subset=X_train,y_subset=y_train,subset_type="Train",round_scores=2)
getevaluation(model=lin_model,X_subset=X_test,y_subset=y_test,subset_type="Test",round_scores=2)
Model Performance for Train subset ::
RMSE: 4.75 | R2 score: 0.74 | MAE: 3.36
Model Performance for Test subset ::
RMSE: 4.64 | R2 score: 0.71 | MAE: 3.16
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
# model evaluation for training set
y_train_predict = lin_model.predict(X_train)
rmse = (np.sqrt(mean_squared_error(y_train, y_train_predict)))
r2 = r2_score(y_train, y_train_predict)
print("The model performance for training set")
print("--------------------------------------")
print('RMSE is {}'.format(rmse))
print('R2 score is {}'.format(r2))
print("\n")
# model evaluation for testing set
y_test_predict = lin_model.predict(X_test)
rmse = (np.sqrt(mean_squared_error(y_test, y_test_predict)))
r2 = r2_score(y_test, y_test_predict)
print("The model performance for testing set")
print("--------------------------------------")
print('RMSE is {}'.format(rmse))
print('R2 score is {}'.format(r2))
The model performance for training set
--------------------------------------
RMSE is 4.748208239685937
R2 score is 0.7434997532004697
The model performance for testing set
--------------------------------------
RMSE is 4.638689926172867
R2 score is 0.7112260057484874
sns.pairplot(df);
<seaborn.axisgrid.PairGrid at 0x7f7d53b7c828>
- abc
- bhshsh
- jhgde