-
Notifications
You must be signed in to change notification settings - Fork 207
[DOC] Add example notebook for using aeon distances with sklearn clusterers #2511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 21 commits
eccd899
714be28
fd26c31
450b0cc
49d8e66
1ce463e
fb04db0
0c3435b
b033484
187e64f
60a1950
83862dc
f015b46
4e5adde
db35469
d4e57a4
b86069b
da602a4
df7287d
3dc8397
566d15d
bf35035
ed052ee
fb45d29
145d64f
6cb7d79
68860e6
50ca0e8
bc8d54b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,249 @@ | ||
{ | ||
SebastianSchmidl marked this conversation as resolved.
Show resolved
Hide resolved
SebastianSchmidl marked this conversation as resolved.
Show resolved
Hide resolved
SebastianSchmidl marked this conversation as resolved.
Show resolved
Hide resolved
SebastianSchmidl marked this conversation as resolved.
Show resolved
Hide resolved
SebastianSchmidl marked this conversation as resolved.
Show resolved
Hide resolved
SebastianSchmidl marked this conversation as resolved.
Show resolved
Hide resolved
SebastianSchmidl marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. X = np.vstack( ( np.random.normal(loc=[2, 2], scale=0.5, size=(50, 2)), np.random.normal(loc=[5, 5], scale=0.5, size=(50, 2)), ) ) Use time series data like you did previously. As mentioned there are unique challenges when it comes to visualising time series, we are not that interested in regular clustering here. Reply via ReviewNB There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the feedback. I have updated the example to use time series data instead of the randomly generated clusters. Please check. |
||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# **Using aeon Distances with scikit-learn Clusterers**\n", | ||
"\n", | ||
"This notebook demonstrates how to integrate aeon’s distance metrics with hierarchical, density-based, and spectral clustering methods from scikit-learn. While aeon primarily supports partition-based clustering algorithms, such as $k$-means and $k$-medoids, its robust distance measures can be leveraged to enable other clustering techniques using scikit-learn.\n", | ||
"\n", | ||
"To measure similarity between time series and enable clustering, we use aeon’s precomputed distance matrices. For details about distance metrics, see the [distance examples](../distances/distances.ipynb).\n", | ||
"\n", | ||
"## **Contents**\n", | ||
"1. **Example Dataset**: Using the `load_unit_test` dataset from aeon.\n", | ||
"2. **Computing Distance Matrices with aeon**: Precomputing distance matrices with aeon’s distance metrics.\n", | ||
"3. **Hierarchical Clustering**\n", | ||
"4. **Density-Based Clustering**\n", | ||
"5. **Spectral Clustering**\n", | ||
"\n", | ||
"## **Example Dataset**\n", | ||
"\n", | ||
"We'll begin by loading a sample dataset. For this demonstration, we'll use the `load_unit_test` dataset from aeon.\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Import & load data\n", | ||
"from aeon.datasets import load_unit_test\n", | ||
"\n", | ||
"X, y = load_unit_test(split=\"train\")\n", | ||
"\n", | ||
"print(f\"Data shape: {X.shape}\")\n", | ||
"print(f\"Labels shape: {y.shape}\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## **Computing Distance Matrices with aeon**\n", | ||
"\n", | ||
"Aeon provides a variety of distance measures suitable for time series data. We'll compute the distance matrix using the Dynamic Time Warping (DTW) distance as an example.\n", | ||
"\n", | ||
"For a comprehensive overview of all available distance metrics in aeon, see the [aeon distances API reference](../api_reference/distances.html).\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from aeon.distances import pairwise_distance\n", | ||
"\n", | ||
"# Compute the pairwise distance matrix using DTW\n", | ||
"distance_matrix = pairwise_distance(X, method=\"dtw\")\n", | ||
"\n", | ||
"print(f\"Distance matrix shape: {distance_matrix.shape}\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## **Hierarchical Clustering**\n", | ||
"\n", | ||
"[AgglomerativeClustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html) is, as the name suggests, an agglomerative approach that works by merging clusters bottom-up. \n", | ||
" \n", | ||
"\n", | ||
"Hierarchical clustering builds a hierarchy of clusters either by progressively merging or splitting existing clusters. We'll use scikit-learn's AgglomerativeClustering with the precomputed distance matrix.\n", | ||
"\n", | ||
"Not all linkage methods can be used with a precomputed distance matrix. The following linkage methods work with aeon distances:\n", | ||
"- `single`\n", | ||
"- `complete`\n", | ||
"- `average`\n", | ||
"- `weighted`" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import matplotlib.pyplot as plt\n", | ||
"import numpy as np\n", | ||
"from sklearn.cluster import AgglomerativeClustering\n", | ||
"\n", | ||
"# Perform Agglomerative Clustering\n", | ||
"agg_clustering = AgglomerativeClustering(\n", | ||
" n_clusters=2, metric=\"precomputed\", linkage=\"average\"\n", | ||
")\n", | ||
"labels = agg_clustering.fit_predict(distance_matrix)\n", | ||
"\n", | ||
"# Visualize the clustering results\n", | ||
"plt.figure(figsize=(10, 6))\n", | ||
"for label in np.unique(labels):\n", | ||
" plt.plot(\n", | ||
" np.mean(X[labels == label], axis=0), label=f\"Cluster {label}\"\n", | ||
" ) # Fix indexing\n", | ||
"plt.title(\"Hierarchical Clustering with DTW Distance\")\n", | ||
"plt.legend()\n", | ||
"plt.show()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## **Density-Based Clustering**\n", | ||
"Density-based clustering identifies clusters based on the density of data points in the feature space. We'll demonstrate this using scikit-learn's `DBSCAN` and `OPTICS` algorithms." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### **DBSCAN**\n", | ||
"\n", | ||
"[DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html) is a density-based clustering algorithm that groups data points based on their density connectivity. \n", | ||
"We use the `DBSCAN` algorithm from scikit-learn with a precomputed distance matrix.\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from sklearn.cluster import DBSCAN\n", | ||
"\n", | ||
"# Perform DBSCAN clustering\n", | ||
"dbscan = DBSCAN(eps=0.5, min_samples=5, metric=\"precomputed\")\n", | ||
"dbscan_labels = dbscan.fit_predict(distance_matrix)\n", | ||
"\n", | ||
"# Visualize the clustering results\n", | ||
"plt.figure(figsize=(10, 6))\n", | ||
"for label in np.unique(dbscan_labels):\n", | ||
" if label == -1:\n", | ||
" # Noise points\n", | ||
" plt.plot(X[dbscan_labels == label].mean(axis=0), label=\"Noise\", linestyle=\"--\")\n", | ||
" else:\n", | ||
" plt.plot(X[dbscan_labels == label].mean(axis=0), label=f\"Cluster {label}\")\n", | ||
"plt.title(\"DBSCAN Clustering with DTW Distance\")\n", | ||
"plt.legend()\n", | ||
"plt.show()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### **OPTICS**\n", | ||
"[DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html) is a density-based clustering algorithm similar to DBSCAN but provides better handling of varying \n", | ||
"densities. We use the `OPTICS` algorithm from scikit-learn with a precomputed distance matrix." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from sklearn.cluster import OPTICS\n", | ||
"\n", | ||
"# Perform OPTICS clustering\n", | ||
"optics = OPTICS(min_samples=5, metric=\"precomputed\")\n", | ||
"optics_labels = optics.fit_predict(distance_matrix)\n", | ||
"\n", | ||
"# Visualize the clustering results\n", | ||
"plt.figure(figsize=(10, 6))\n", | ||
"for label in np.unique(optics_labels):\n", | ||
" if label == -1:\n", | ||
" # Noise points\n", | ||
" plt.plot(X[optics_labels == label].mean(axis=0), label=\"Noise\", linestyle=\"--\")\n", | ||
" else:\n", | ||
" plt.plot(X[optics_labels == label].mean(axis=0), label=f\"Cluster {label}\")\n", | ||
"plt.title(\"OPTICS Clustering with DTW Distance\")\n", | ||
"plt.legend()\n", | ||
"plt.show()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## **Spectral Clustering**\n", | ||
"[SpectralClustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering.html) performs dimensionality reduction on the data before clustering in fewer dimensions. It requires a similarity matrix, so we'll convert our distance matrix accordingly." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import matplotlib.pyplot as plt\n", | ||
"import numpy as np\n", | ||
"from sklearn.cluster import SpectralClustering\n", | ||
"\n", | ||
"# Ensure the distance matrix does not contain zeros on the diagonal or elsewhere\n", | ||
"# Normalize distance values to [0, 1] and convert to similarities\n", | ||
"inverse_distance_matrix = 1 - (distance_matrix / distance_matrix.max())\n", | ||
"\n", | ||
"# Perform Spectral Clustering with affinity=\"precomputed\"\n", | ||
"spectral = SpectralClustering(n_clusters=2, affinity=\"precomputed\", random_state=42)\n", | ||
"spectral_labels = spectral.fit_predict(inverse_distance_matrix)\n", | ||
"\n", | ||
"# Visualising the clustering results\n", | ||
"plt.figure(figsize=(10, 6))\n", | ||
"for label in np.unique(spectral_labels):\n", | ||
" plt.plot(X[spectral_labels == label].mean(axis=0), label=f\"Cluster {label}\")\n", | ||
"plt.title(\"Spectral Clustering with Normalized Similarity Matrix\")\n", | ||
"plt.legend()\n", | ||
"plt.show()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": ".venv", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.12.6" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
Uh oh!
There was an error while loading. Please reload this page.