-
Notifications
You must be signed in to change notification settings - Fork 207
[DOC] Add example notebook for using aeon distances with sklearn clusterers #2511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 1 commit
eccd899
714be28
fd26c31
450b0cc
49d8e66
1ce463e
fb04db0
0c3435b
b033484
187e64f
60a1950
83862dc
f015b46
4e5adde
db35469
d4e57a4
b86069b
da602a4
df7287d
3dc8397
566d15d
bf35035
ed052ee
fb45d29
145d64f
6cb7d79
68860e6
50ca0e8
bc8d54b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,274 @@ | ||
{ | ||
SebastianSchmidl marked this conversation as resolved.
Show resolved
Hide resolved
SebastianSchmidl marked this conversation as resolved.
Show resolved
Hide resolved
SebastianSchmidl marked this conversation as resolved.
Show resolved
Hide resolved
SebastianSchmidl marked this conversation as resolved.
Show resolved
Hide resolved
SebastianSchmidl marked this conversation as resolved.
Show resolved
Hide resolved
SebastianSchmidl marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. X = np.vstack( ( np.random.normal(loc=[2, 2], scale=0.5, size=(50, 2)), np.random.normal(loc=[5, 5], scale=0.5, size=(50, 2)), ) ) Use time series data like you did previously. As mentioned there are unique challenges when it comes to visualising time series, we are not that interested in regular clustering here. Reply via ReviewNB There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the feedback. I have updated the example to use time series data instead of the randomly generated clusters. Please check. |
||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# **Using aeon Distances with scikit-learn Clusterers**\n", | ||
"\n", | ||
"This notebook demonstrates how to integrate aeon’s distance metrics with hierarchical, density-based, and spectral clustering methods from scikit-learn. While aeon primarily supports partition-based clustering algorithms, such as $k$-means and $k$-medoids, its robust distance measures can be leveraged to enable other clustering techniques using scikit-learn.\n", | ||
"\n", | ||
"Broadly, clustering algorithms can be categorized into partition-based, hierarchical, density-based, and spectral methods. In this notebook, we focus on using aeon’s distance metrics with:\n", | ||
"1. **Hierarchical clustering**: `AgglomerativeClustering` with `metric=\"precomputed\"`.\n", | ||
"2. **Density-based clustering**: `DBSCAN` and `OPTICS` with `metric=\"precomputed\"`.\n", | ||
"3. **Spectral clustering**: `SpectralClustering` with `affinity=\"precomputed\"` and the inverse of the distance matrix.\n", | ||
"\n", | ||
"To measure similarity between time series and enable clustering, we use aeon’s precomputed distance matrices. For details about distance metrics, see the [distance examples](../distances/distances.ipynb).\n", | ||
"\n", | ||
"## **Contents**\n", | ||
"1. **Introduction**: Overview of clustering methods and motivation for this notebook.\n", | ||
"2. **Loading Data**: Using the `load_unit_test` dataset from aeon.\n", | ||
"3. **Computing Distance Matrices with aeon**: Precomputing distance matrices with aeon’s distance metrics.\n", | ||
"\n", | ||
"4. **Hierarchical Clustering**\n", | ||
" 4.1 sklearn.cluster.AgglomerativeClustering with metric=\"precomputed\"\n", | ||
"\n", | ||
"5. **Density-Based Clustering**\n", | ||
" 5.1 sklearn.cluster.DBSCAN with metric=\"precomputed\"\n", | ||
" 5.2 sklearn.cluster.OPTICS with metric=\"precomputed\"\n", | ||
"\n", | ||
"6. **Spectral Clustering**\n", | ||
" 6.1 sklearn.cluster.SpectralClustering with affinity=\"precomputed\"\n", | ||
" 6.2 Using the Inverse of the Distance Matrix\n", | ||
"\n", | ||
"## **Introduction**\n", | ||
"\n", | ||
"While aeon primarily focuses on partition-based clustering methods, it's possible to extend its capabilities by integrating its distance metrics with scikit-learn's clustering algorithms. This approach allows us to perform hierarchical, density-based, and spectral clustering on time series data using aeon's rich set of distance measures.\n", | ||
"\n", | ||
"## **Loading Data**\n", | ||
"\n", | ||
"We'll begin by loading a sample dataset. For this demonstration, we'll use the `load_unit_test` dataset from aeon." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Import & load data\n", | ||
"from aeon.datasets import load_unit_test\n", | ||
"X_train, y_train = load_unit_test(split=\"train\")\n", | ||
"X_test, y_test = load_unit_test(split=\"test\")\n", | ||
"\n", | ||
"# For simplicity, we'll work with the training data\n", | ||
"X = X_train\n", | ||
"y = y_train\n", | ||
"\n", | ||
"print(f\"Data shape: {X.shape}\")\n", | ||
"print(f\"Labels shape: {y.shape}\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## **Computing Distance Matrices with aeon**\n", | ||
"Aeon provides a variety of distance measures suitable for time series data. We'll compute the distance matrix using the Dynamic Time Warping (DTW) distance as an example." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from aeon.distances import pairwise_distance\n", | ||
"\n", | ||
"# Compute the pairwise distance matrix using DTW\n", | ||
"distance_matrix = pairwise_distance(X, metric=\"dtw\")\n", | ||
"\n", | ||
"print(f\"Distance matrix shape: {distance_matrix.shape}\")\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## **Hierarchical Clustering**\n", | ||
"Hierarchical clustering builds a hierarchy of clusters either by progressively merging or splitting existing clusters. We'll use scikit-learn's AgglomerativeClustering with the precomputed distance matrix." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from sklearn.cluster import AgglomerativeClustering\n", | ||
"import matplotlib.pyplot as plt\n", | ||
"import numpy as np\n", | ||
"\n", | ||
"# Perform Agglomerative Clustering\n", | ||
"agg_clustering = AgglomerativeClustering(\n", | ||
" n_clusters=2, affinity=\"precomputed\", linkage=\"average\"\n", | ||
")\n", | ||
"labels = agg_clustering.fit_predict(distance_matrix)\n", | ||
"\n", | ||
"# Visualize the clustering results\n", | ||
"plt.figure(figsize=(10, 6))\n", | ||
"for label in np.unique(labels):\n", | ||
" plt.plot(X[labels == label].mean(axis=0), label=f\"Cluster {label}\")\n", | ||
"plt.title(\"Hierarchical Clustering with DTW Distance\")\n", | ||
"plt.legend()\n", | ||
"plt.show()\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## **Density-Based Clustering**\n", | ||
"Density-based clustering identifies clusters based on the density of data points in the feature space. We'll demonstrate this using scikit-learn's `DBSCAN` and `OPTICS` algorithms." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### **DBSCAN**\n", | ||
"\n", | ||
"DBSCAN is a density-based clustering algorithm that groups data points based on their density connectivity. \n", | ||
"We use the `DBSCAN` algorithm from scikit-learn with a precomputed distance matrix.\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from sklearn.cluster import DBSCAN\n", | ||
"\n", | ||
"# Perform DBSCAN clustering\n", | ||
"dbscan = DBSCAN(eps=0.5, min_samples=5, metric=\"precomputed\")\n", | ||
"dbscan_labels = dbscan.fit_predict(distance_matrix)\n", | ||
"\n", | ||
"# Visualize the clustering results\n", | ||
"plt.figure(figsize=(10, 6))\n", | ||
"for label in np.unique(dbscan_labels):\n", | ||
" if label == -1:\n", | ||
" # Noise points\n", | ||
" plt.plot(X[dbscan_labels == label].mean(axis=0), label=\"Noise\", linestyle=\"--\")\n", | ||
" else:\n", | ||
" plt.plot(X[dbscan_labels == label].mean(axis=0), label=f\"Cluster {label}\")\n", | ||
"plt.title(\"DBSCAN Clustering with DTW Distance\")\n", | ||
"plt.legend()\n", | ||
"plt.show()\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### **OPTICS**\n", | ||
"OPTICS is a density-based clustering algorithm similar to DBSCAN but provides better handling of varying \n", | ||
"densities. We use the `OPTICS` algorithm from scikit-learn with a precomputed distance matrix." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from sklearn.cluster import OPTICS\n", | ||
"\n", | ||
"# Perform OPTICS clustering\n", | ||
"optics = OPTICS(min_samples=5, metric=\"precomputed\")\n", | ||
"optics_labels = optics.fit_predict(distance_matrix)\n", | ||
"\n", | ||
"# Visualize the clustering results\n", | ||
"plt.figure(figsize=(10, 6))\n", | ||
"for label in np.unique(optics_labels):\n", | ||
" if label == -1:\n", | ||
" # Noise points\n", | ||
" plt.plot(X[optics_labels == label].mean(axis=0), label=\"Noise\", linestyle=\"--\")\n", | ||
" else:\n", | ||
" plt.plot(X[optics_labels == label].mean(axis=0), label=f\"Cluster {label}\")\n", | ||
"plt.title(\"OPTICS Clustering with DTW Distance\")\n", | ||
"plt.legend()\n", | ||
"plt.show()\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## **Spectral Clustering**\n", | ||
"Spectral clustering performs dimensionality reduction on the data before clustering in fewer dimensions. It requires a similarity matrix, so we'll convert our distance matrix accordingly." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from sklearn.cluster import SpectralClustering\n", | ||
"import numpy as np\n", | ||
"\n", | ||
"# Ensure the distance matrix does not contain zeros on the diagonal or elsewhere\n", | ||
"# Now adding a small constant to avoid division by zero\n", | ||
"epsilon = 1e-10\n", | ||
"inverse_distance_matrix = 1 /(distance_matrix + epsilon)\n", | ||
"\n", | ||
"# Perform Spectral Clustering with affinity=\"precomputed\"\n", | ||
"spectral = SpectralClustering(\n", | ||
" n_clusters=2, affinity=\"precomputed\", random_state=42\n", | ||
")\n", | ||
"spectral_labels = spectral.fit_predict(inverse_distance_matrix)\n", | ||
"\n", | ||
"# Visualising the clustering results\n", | ||
"plt.figure(figsize=(10, 6))\n", | ||
"for label in np.unique(spectral_labels):\n", | ||
" plt.plot(X[spectral_labels == label].mean(axis=0), label=f\"Cluster {label}\")\n", | ||
"plt.title(\"Spectral Clustering with Inverse Distance Matrix\")\n", | ||
"plt.legend()\n", | ||
"plt.show()\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## **References**\n", | ||
"\n", | ||
"[1] Christopher Holder, Matthew Middlehurst, and Anthony Bagnall. A Review and Evaluation of Elastic Distance Functions for Time Series Clustering, Knowledge and Information Systems. In Press (2023).\n", | ||
"\n", | ||
"[2] Christopher Holder, David Guijo-Rubio, and Anthony Bagnall. Barycentre averaging for the move-split-merge time series distance measure. 15th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (2023).\n", | ||
"\n", | ||
"[3] Kaufman, Leonard & Rousseeuw, Peter. (1986). Clustering Large Data Sets. 10.1016/B978-0-444-87877-9.50039-X.\n", | ||
"\n", | ||
"[4] R. T. Ng and Jiawei Han. \"CLARANS: a method for clustering objects spatial data mining.\" IEEE Transactions on Knowledge and Data Engineering vol. 14, no. 5, pp. 1003-1016, Sept.-Oct. 2002, doi: 10.1109/TKDE.2002.1033770.\n", | ||
"\n", | ||
"[5] Paparrizos, John, and Luis Gravano. \"Fast and Accurate Time-Series Clustering.\" ACM Transactions on Database Systems 42, no. 2 (2017): 8:1-8:49.\n", | ||
"\n", | ||
"[6] F. Petitjean, A. Ketterlin and P. Gancarski. “A global averaging method for dynamic time warping, with applications to clustering,” Pattern Recognition, vol. 44, pp. 678-693, 2011.\n", | ||
"\n", | ||
"Generated using [nbsphinx](https://nbsphinx.readthedocs.io/). The Jupyter notebook can be found here[here](sklearn_clustering_with_aeon_distances.html)." | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": ".venv", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.12.6" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
Uh oh!
There was an error while loading. Please reload this page.