Skip to content
This repository was archived by the owner on Jun 20, 2023. It is now read-only.

Commit 761efb5

Browse files
authored
Merge pull request #6 from Code-Plus-CUMI/CSFelix-patch-1
🎉 Added
2 parents 6140fa7 + 8d5e203 commit 761efb5

File tree

3 files changed

+192
-0
lines changed

3 files changed

+192
-0
lines changed
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
"""
2+
3+
*******************************
4+
** Clustering - Introduction **
5+
*******************************
6+
7+
Clustering is a Machine Learning Technique to classify the
8+
datas into groups (clusters) following patterns the algorithm
9+
has learned.
10+
11+
These patterns are not explicity, that means that once a
12+
model has clustered the datas, it is a task to Data Scientists
13+
figure out what are the patterns and why they have been choosen.
14+
15+
Versions-wise, there a lot of them, but the two main ones are
16+
K-Means and Hierarchial, being:
17+
18+
- K-Means Clustering: applies K-Nearest Neighbors to create the
19+
clusters;
20+
21+
- Hiearchial Clustering: applies K-Neartes Neighbors to create
22+
the clusters AND there is a hierarchial / importance relationship
23+
between the groups.
24+
"""
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
"""
2+
3+
************************
4+
** K-Means - Clusters **
5+
************************
6+
7+
K-Means defines CENTROIDS and it's goal is to find
8+
the perfect position for each centroid and its territory
9+
(TESSALATION).
10+
11+
-*-*-*-*-
12+
13+
When creating this algorithm, you have to pay attention to three
14+
parameters:
15+
16+
/ n_clusters: number of Clusters (K)
17+
18+
/ max_iter: number of iterations
19+
20+
/ n_init: gets the Centroids' Position has the least total
21+
distance between each point and its centroid, the optimal
22+
clustering.
23+
24+
-*-*-*-*-
25+
26+
Besides, since K-Means clustering is sensitive to scale, it can
27+
be a good idea RESCALE or NORMALIZE data with extreme values.
28+
Our features are already roughly on the same scale, so we'll
29+
leave them as-is.
30+
"""
31+
32+
33+
34+
# ---- Importing Libraries and Preparing DataSet ----
35+
from sklearn.cluster import KMeans
36+
import pandas as pd
37+
import matplotlib.pyplot as plt
38+
import seaborn as sns
39+
sns.set_style('whitegrid')
40+
41+
df = pd.read_csv('filepath')
42+
X = df.loc[:, ['Latitude', 'Longitude', 'MedInc']]
43+
44+
45+
46+
# ---- WCSS and Elbow Method ----
47+
#
48+
# Their function is to check out how many clusters is great
49+
# to the process
50+
#
51+
wcss = []
52+
53+
# testing out with 1 to 11 clusters
54+
for i in range(1, 11):
55+
56+
# n_clusters >> number of clusters to be identified
57+
# max_iter >> number of iterations for each run (n_init)
58+
# n_init >> number of runs (centroids iteration)
59+
kmeans = KMeans(n_clusters=i,
60+
init='k-means++',
61+
max_iter=300,
62+
n_init=10,
63+
random_state=0)
64+
65+
kmeans.fit(X)
66+
wcss.append(kmeans.inertia_)
67+
68+
# plotting the results (we often choose the amount
69+
# of clusters where the WCSS starts to level off -
70+
# elbow method)
71+
plt.plot(range(1, 11), wcss)
72+
plt.title('Elbow Method')
73+
plt.xlabel('Number of clusters')
74+
plt.ylabel('WCSS')
75+
plt.show()
76+
77+
78+
79+
# ---- Using K-Means ----
80+
#
81+
# Consider that 4 clusters was the great amount and we
82+
# will repeat the K-Means centroids moviment 10 times
83+
kmeans = KMeans(n_clusters=4,
84+
init='k-means++',
85+
max_iter=300,
86+
n_init=10,
87+
random_state=0)
88+
89+
X['Cluster'] = kmeans.fit_predict(X)
90+
X['Cluster'] = X['Cluster'].astype('category')
91+
92+
93+
94+
# ---- Plotting the Results ----
95+
sns.relplot(
96+
x="Longitude", y="Latitude", hue="Cluster", data=X, height=6,
97+
);
98+
99+
100+
101+
# OBS.: Comparing the Target - box-plots show the distribution of
102+
#the target within each cluster. If the clustering is informative,
103+
# these distributions should, for the most part, separate across
104+
# MedHouseVal (Target), which is indeed what we see.
105+
X["MedHouseVal"] = df["MedHouseVal"]
106+
sns.catplot(x="MedHouseVal", y="Cluster", data=X, kind="boxen", height=6);
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
"""
2+
3+
****************************
4+
** Hierarchial Clustering **
5+
****************************
6+
7+
Hierarchial cluster the datas and each cluster has a weight
8+
that represents its position in a hierarchy between them.
9+
10+
-*-*-*-*-
11+
12+
When creating this algorithm, you have to pay attention to three
13+
parameters:
14+
15+
/ n_clusters: number of Clusters (K)
16+
17+
/ linkage: type of linkage in the hierarchy
18+
19+
/ afinity: variation of the algorithm (Euclidean is the most
20+
common).
21+
22+
-*-*-*-*-
23+
24+
Besides, since Hierarchial clustering is sensitive to scale, it can
25+
be a good idea RESCALE or NORMALIZE data with extreme values.
26+
Our features are already roughly on the same scale, so we'll
27+
leave them as-is.
28+
"""
29+
30+
# ---- Importing Libraries ----
31+
from scipy.cluster.hierarchy import dendrogram, linkage
32+
from sklearn.cluster import AgglomerativeClustering
33+
from sklearn.cluster import KMeans
34+
import pandas as pd
35+
import matplotlib.pyplot as plt
36+
37+
38+
39+
# ---- Dendograms and Linkage Visualization ----
40+
#
41+
# With this visualization is possible to know how many
42+
# clusters (n_clusters parameter) will best fit the algorithm
43+
#
44+
linkage_data = linkage(df, method='ward', metric='euclidean')
45+
dendrogram(linkage_data)
46+
47+
48+
49+
# ---- Applying Hierarchial ----
50+
hierarchical_cluster = AgglomerativeClustering(
51+
n_clusters=2
52+
, affinity='euclidean'
53+
, linkage='ward'
54+
)
55+
56+
labels = hierarchical_cluster.fit_predict(df)
57+
58+
59+
60+
# ---- Plotting the Result ----
61+
plt.scatter(x, y, c=labels)
62+
plt.show()

0 commit comments

Comments
 (0)