Skip to content

[DOC] Add example notebook for using aeon distances with sklearn clusterers #2511

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 29 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
eccd899
Add example notebook for using aeon distances with sklearn clusterers
SalmanDeveloperz Jan 22, 2025
714be28
Resolved conflicts and updated notebook:
SalmanDeveloperz Jan 26, 2025
fd26c31
Simplified dataset loading using `load_unit_test(split="train")` as s…
SalmanDeveloperz Jan 26, 2025
450b0cc
Added this sentence after the introductory line:
SalmanDeveloperz Jan 26, 2025
49d8e66
Changes made:-
SalmanDeveloperz Jan 26, 2025
1ce463e
Added links to the scikit-learn documentation pages for all reference…
SalmanDeveloperz Jan 26, 2025
fb04db0
Updated distance-to-similarity conversion to normalize distances and …
SalmanDeveloperz Jan 26, 2025
0c3435b
Removed the references section as it was not cited in the notebook, p…
SalmanDeveloperz Jan 26, 2025
b033484
Added a reference to the new notebook (sklearn_clustering_with_aeon_d…
SalmanDeveloperz Jan 26, 2025
187e64f
Updated the aeon distances API reference link to a relative link for …
SalmanDeveloperz Feb 2, 2025
60a1950
Removed duplicate "Hierarchical Clustering" header to improve clarity…
SalmanDeveloperz Feb 2, 2025
83862dc
Added a reference to the new notebook in the Clustering-with-sklearn.…
SalmanDeveloperz Feb 2, 2025
f015b46
Merge branch 'main' into add-sklearn-clustering-example
SebastianSchmidl Feb 4, 2025
4e5adde
Automatic `pre-commit` fixes
SebastianSchmidl Feb 4, 2025
db35469
Added a reference to the new sklearn clustering notebook in the Clust…
SalmanDeveloperz Feb 4, 2025
d4e57a4
Fix: Corrected DTW metric in aeon pairwise_distance to resolve CI job…
SalmanDeveloperz Feb 18, 2025
b86069b
changing matric to method for dtw, trying to resolved the CI Jobs issue.
SalmanDeveloperz Feb 18, 2025
da602a4
Automatic `pre-commit` fixes
SalmanDeveloperz Feb 18, 2025
df7287d
Fix AgglomerativeClustering error by replacing 'affinity' with 'metri…
SalmanDeveloperz Feb 18, 2025
3dc8397
Merge branch 'add-sklearn-clustering-example' of https://github.com/S…
SalmanDeveloperz Feb 18, 2025
566d15d
Automatic `pre-commit` fixes
SalmanDeveloperz Feb 18, 2025
bf35035
Added sklearn clustering example image
SalmanDeveloperz Feb 19, 2025
ed052ee
Added sklearn clustering example image
SalmanDeveloperz Feb 19, 2025
fb45d29
Automatic `pre-commit` fixes
SalmanDeveloperz Feb 19, 2025
145d64f
Display notebook outputs and improve clustering accuracy
SalmanDeveloperz Mar 9, 2025
6cb7d79
Merge branch 'add-sklearn-clustering-example' of https://github.com/S…
SalmanDeveloperz Mar 9, 2025
68860e6
resolve conflict
SalmanDeveloperz Mar 9, 2025
50ca0e8
git commit -m "Fix IndentationError in sklearn_clustering_with_aeon_d…
SalmanDeveloperz Mar 9, 2025
bc8d54b
Re-applied Spectral clustering on time series data using Aeon
SalmanDeveloperz Apr 11, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
274 changes: 274 additions & 0 deletions examples/clustering/sklearn_clustering_with_aeon_distances.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,274 @@
{
Copy link
Member

@MatthewMiddlehurst MatthewMiddlehurst Mar 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X = np.vstack(
   (
       np.random.normal(loc=[2, 2], scale=0.5, size=(50, 2)),
       np.random.normal(loc=[5, 5], scale=0.5, size=(50, 2)),
   )
)

Use time series data like you did previously. As mentioned there are unique challenges when it comes to visualising time series, we are not that interested in regular clustering here.


Reply via ReviewNB

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. I have updated the example to use time series data instead of the randomly generated clusters. Please check.

"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# **Using aeon Distances with scikit-learn Clusterers**\n",
"\n",
"This notebook demonstrates how to integrate aeon’s distance metrics with hierarchical, density-based, and spectral clustering methods from scikit-learn. While aeon primarily supports partition-based clustering algorithms, such as $k$-means and $k$-medoids, its robust distance measures can be leveraged to enable other clustering techniques using scikit-learn.\n",
"\n",
"Broadly, clustering algorithms can be categorized into partition-based, hierarchical, density-based, and spectral methods. In this notebook, we focus on using aeon’s distance metrics with:\n",
"1. **Hierarchical clustering**: `AgglomerativeClustering` with `metric=\"precomputed\"`.\n",
"2. **Density-based clustering**: `DBSCAN` and `OPTICS` with `metric=\"precomputed\"`.\n",
"3. **Spectral clustering**: `SpectralClustering` with `affinity=\"precomputed\"` and the inverse of the distance matrix.\n",
"\n",
"To measure similarity between time series and enable clustering, we use aeon’s precomputed distance matrices. For details about distance metrics, see the [distance examples](../distances/distances.ipynb).\n",
"\n",
"## **Contents**\n",
"1. **Introduction**: Overview of clustering methods and motivation for this notebook.\n",
"2. **Loading Data**: Using the `load_unit_test` dataset from aeon.\n",
"3. **Computing Distance Matrices with aeon**: Precomputing distance matrices with aeon’s distance metrics.\n",
"\n",
"4. **Hierarchical Clustering**\n",
" 4.1 sklearn.cluster.AgglomerativeClustering with metric=\"precomputed\"\n",
"\n",
"5. **Density-Based Clustering**\n",
" 5.1 sklearn.cluster.DBSCAN with metric=\"precomputed\"\n",
" 5.2 sklearn.cluster.OPTICS with metric=\"precomputed\"\n",
"\n",
"6. **Spectral Clustering**\n",
" 6.1 sklearn.cluster.SpectralClustering with affinity=\"precomputed\"\n",
" 6.2 Using the Inverse of the Distance Matrix\n",
"\n",
"## **Introduction**\n",
"\n",
"While aeon primarily focuses on partition-based clustering methods, it's possible to extend its capabilities by integrating its distance metrics with scikit-learn's clustering algorithms. This approach allows us to perform hierarchical, density-based, and spectral clustering on time series data using aeon's rich set of distance measures.\n",
"\n",
"## **Loading Data**\n",
"\n",
"We'll begin by loading a sample dataset. For this demonstration, we'll use the `load_unit_test` dataset from aeon."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Import & load data\n",
"from aeon.datasets import load_unit_test\n",
"X_train, y_train = load_unit_test(split=\"train\")\n",
"X_test, y_test = load_unit_test(split=\"test\")\n",
"\n",
"# For simplicity, we'll work with the training data\n",
"X = X_train\n",
"y = y_train\n",
"\n",
"print(f\"Data shape: {X.shape}\")\n",
"print(f\"Labels shape: {y.shape}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **Computing Distance Matrices with aeon**\n",
"Aeon provides a variety of distance measures suitable for time series data. We'll compute the distance matrix using the Dynamic Time Warping (DTW) distance as an example."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from aeon.distances import pairwise_distance\n",
"\n",
"# Compute the pairwise distance matrix using DTW\n",
"distance_matrix = pairwise_distance(X, metric=\"dtw\")\n",
"\n",
"print(f\"Distance matrix shape: {distance_matrix.shape}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **Hierarchical Clustering**\n",
"Hierarchical clustering builds a hierarchy of clusters either by progressively merging or splitting existing clusters. We'll use scikit-learn's AgglomerativeClustering with the precomputed distance matrix."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.cluster import AgglomerativeClustering\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"\n",
"# Perform Agglomerative Clustering\n",
"agg_clustering = AgglomerativeClustering(\n",
" n_clusters=2, affinity=\"precomputed\", linkage=\"average\"\n",
")\n",
"labels = agg_clustering.fit_predict(distance_matrix)\n",
"\n",
"# Visualize the clustering results\n",
"plt.figure(figsize=(10, 6))\n",
"for label in np.unique(labels):\n",
" plt.plot(X[labels == label].mean(axis=0), label=f\"Cluster {label}\")\n",
"plt.title(\"Hierarchical Clustering with DTW Distance\")\n",
"plt.legend()\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **Density-Based Clustering**\n",
"Density-based clustering identifies clusters based on the density of data points in the feature space. We'll demonstrate this using scikit-learn's `DBSCAN` and `OPTICS` algorithms."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### **DBSCAN**\n",
"\n",
"DBSCAN is a density-based clustering algorithm that groups data points based on their density connectivity. \n",
"We use the `DBSCAN` algorithm from scikit-learn with a precomputed distance matrix.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.cluster import DBSCAN\n",
"\n",
"# Perform DBSCAN clustering\n",
"dbscan = DBSCAN(eps=0.5, min_samples=5, metric=\"precomputed\")\n",
"dbscan_labels = dbscan.fit_predict(distance_matrix)\n",
"\n",
"# Visualize the clustering results\n",
"plt.figure(figsize=(10, 6))\n",
"for label in np.unique(dbscan_labels):\n",
" if label == -1:\n",
" # Noise points\n",
" plt.plot(X[dbscan_labels == label].mean(axis=0), label=\"Noise\", linestyle=\"--\")\n",
" else:\n",
" plt.plot(X[dbscan_labels == label].mean(axis=0), label=f\"Cluster {label}\")\n",
"plt.title(\"DBSCAN Clustering with DTW Distance\")\n",
"plt.legend()\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### **OPTICS**\n",
"OPTICS is a density-based clustering algorithm similar to DBSCAN but provides better handling of varying \n",
"densities. We use the `OPTICS` algorithm from scikit-learn with a precomputed distance matrix."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.cluster import OPTICS\n",
"\n",
"# Perform OPTICS clustering\n",
"optics = OPTICS(min_samples=5, metric=\"precomputed\")\n",
"optics_labels = optics.fit_predict(distance_matrix)\n",
"\n",
"# Visualize the clustering results\n",
"plt.figure(figsize=(10, 6))\n",
"for label in np.unique(optics_labels):\n",
" if label == -1:\n",
" # Noise points\n",
" plt.plot(X[optics_labels == label].mean(axis=0), label=\"Noise\", linestyle=\"--\")\n",
" else:\n",
" plt.plot(X[optics_labels == label].mean(axis=0), label=f\"Cluster {label}\")\n",
"plt.title(\"OPTICS Clustering with DTW Distance\")\n",
"plt.legend()\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **Spectral Clustering**\n",
"Spectral clustering performs dimensionality reduction on the data before clustering in fewer dimensions. It requires a similarity matrix, so we'll convert our distance matrix accordingly."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.cluster import SpectralClustering\n",
"import numpy as np\n",
"\n",
"# Ensure the distance matrix does not contain zeros on the diagonal or elsewhere\n",
"# Now adding a small constant to avoid division by zero\n",
"epsilon = 1e-10\n",
"inverse_distance_matrix = 1 /(distance_matrix + epsilon)\n",
"\n",
"# Perform Spectral Clustering with affinity=\"precomputed\"\n",
"spectral = SpectralClustering(\n",
" n_clusters=2, affinity=\"precomputed\", random_state=42\n",
")\n",
"spectral_labels = spectral.fit_predict(inverse_distance_matrix)\n",
"\n",
"# Visualising the clustering results\n",
"plt.figure(figsize=(10, 6))\n",
"for label in np.unique(spectral_labels):\n",
" plt.plot(X[spectral_labels == label].mean(axis=0), label=f\"Cluster {label}\")\n",
"plt.title(\"Spectral Clustering with Inverse Distance Matrix\")\n",
"plt.legend()\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **References**\n",
"\n",
"[1] Christopher Holder, Matthew Middlehurst, and Anthony Bagnall. A Review and Evaluation of Elastic Distance Functions for Time Series Clustering, Knowledge and Information Systems. In Press (2023).\n",
"\n",
"[2] Christopher Holder, David Guijo-Rubio, and Anthony Bagnall. Barycentre averaging for the move-split-merge time series distance measure. 15th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (2023).\n",
"\n",
"[3] Kaufman, Leonard & Rousseeuw, Peter. (1986). Clustering Large Data Sets. 10.1016/B978-0-444-87877-9.50039-X.\n",
"\n",
"[4] R. T. Ng and Jiawei Han. \"CLARANS: a method for clustering objects spatial data mining.\" IEEE Transactions on Knowledge and Data Engineering vol. 14, no. 5, pp. 1003-1016, Sept.-Oct. 2002, doi: 10.1109/TKDE.2002.1033770.\n",
"\n",
"[5] Paparrizos, John, and Luis Gravano. \"Fast and Accurate Time-Series Clustering.\" ACM Transactions on Database Systems 42, no. 2 (2017): 8:1-8:49.\n",
"\n",
"[6] F. Petitjean, A. Ketterlin and P. Gancarski. “A global averaging method for dynamic time warping, with applications to clustering,” Pattern Recognition, vol. 44, pp. 678-693, 2011.\n",
"\n",
"Generated using [nbsphinx](https://nbsphinx.readthedocs.io/). The Jupyter notebook can be found here[here](sklearn_clustering_with_aeon_distances.html)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading