Skip to content

[DOC] Add example notebook for using aeon distances with sklearn clusterers #2511

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 29 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
eccd899
Add example notebook for using aeon distances with sklearn clusterers
SalmanDeveloperz Jan 22, 2025
714be28
Resolved conflicts and updated notebook:
SalmanDeveloperz Jan 26, 2025
fd26c31
Simplified dataset loading using `load_unit_test(split="train")` as s…
SalmanDeveloperz Jan 26, 2025
450b0cc
Added this sentence after the introductory line:
SalmanDeveloperz Jan 26, 2025
49d8e66
Changes made:-
SalmanDeveloperz Jan 26, 2025
1ce463e
Added links to the scikit-learn documentation pages for all reference…
SalmanDeveloperz Jan 26, 2025
fb04db0
Updated distance-to-similarity conversion to normalize distances and …
SalmanDeveloperz Jan 26, 2025
0c3435b
Removed the references section as it was not cited in the notebook, p…
SalmanDeveloperz Jan 26, 2025
b033484
Added a reference to the new notebook (sklearn_clustering_with_aeon_d…
SalmanDeveloperz Jan 26, 2025
187e64f
Updated the aeon distances API reference link to a relative link for …
SalmanDeveloperz Feb 2, 2025
60a1950
Removed duplicate "Hierarchical Clustering" header to improve clarity…
SalmanDeveloperz Feb 2, 2025
83862dc
Added a reference to the new notebook in the Clustering-with-sklearn.…
SalmanDeveloperz Feb 2, 2025
f015b46
Merge branch 'main' into add-sklearn-clustering-example
SebastianSchmidl Feb 4, 2025
4e5adde
Automatic `pre-commit` fixes
SebastianSchmidl Feb 4, 2025
db35469
Added a reference to the new sklearn clustering notebook in the Clust…
SalmanDeveloperz Feb 4, 2025
d4e57a4
Fix: Corrected DTW metric in aeon pairwise_distance to resolve CI job…
SalmanDeveloperz Feb 18, 2025
b86069b
changing matric to method for dtw, trying to resolved the CI Jobs issue.
SalmanDeveloperz Feb 18, 2025
da602a4
Automatic `pre-commit` fixes
SalmanDeveloperz Feb 18, 2025
df7287d
Fix AgglomerativeClustering error by replacing 'affinity' with 'metri…
SalmanDeveloperz Feb 18, 2025
3dc8397
Merge branch 'add-sklearn-clustering-example' of https://github.com/S…
SalmanDeveloperz Feb 18, 2025
566d15d
Automatic `pre-commit` fixes
SalmanDeveloperz Feb 18, 2025
bf35035
Added sklearn clustering example image
SalmanDeveloperz Feb 19, 2025
ed052ee
Added sklearn clustering example image
SalmanDeveloperz Feb 19, 2025
fb45d29
Automatic `pre-commit` fixes
SalmanDeveloperz Feb 19, 2025
145d64f
Display notebook outputs and improve clustering accuracy
SalmanDeveloperz Mar 9, 2025
6cb7d79
Merge branch 'add-sklearn-clustering-example' of https://github.com/S…
SalmanDeveloperz Mar 9, 2025
68860e6
resolve conflict
SalmanDeveloperz Mar 9, 2025
50ca0e8
git commit -m "Fix IndentationError in sklearn_clustering_with_aeon_d…
SalmanDeveloperz Mar 9, 2025
bc8d54b
Re-applied Spectral clustering on time series data using Aeon
SalmanDeveloperz Apr 11, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions docs/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,17 @@ Partitional TSCL

:::

:::{grid-item-card}
:img-top: examples/clustering/img/sklearn_clustering.png
:class-img-top: aeon-card-image-m
:link: /examples/clustering/sklearn_clustering_with_aeon_distances.ipynb
:link-type: ref
:text-align: center

Using aeon Distances with sklearn Clusterers

:::

::::

## Transformation
Expand Down
26 changes: 14 additions & 12 deletions examples/clustering/clustering.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"# Time Series Clustering\n",
"\n",
Expand All @@ -23,13 +26,13 @@
"erative [16], Feature K-means [17], Feature K-medoids [17], U-shapelets [18],\n",
"USSL [19], RSFS [20], NDFS [21], Deep learning and dimensionality reduction\n",
"approaches see [22]"
],
"metadata": {
"collapsed": false
}
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"## Clustering notebooks\n",
"\n",
Expand All @@ -41,6 +44,8 @@
"these can be used in conjunction with `aeon` elastic distances. See the [sklearn and\n",
"aeon distances](../distances/sklearn_distances.ipynb) notebook.\n",
"\n",
"- For more detailed examples of using `aeon` distances with sklearn clusterers, refer to the [sklearn clustering with aeon distances](sklearn_clustering_with_aeon_distances.ipynb) notebook.\n",
"\n",
"- Deep learning based TSCL is a very popular topic, and we are working on bringing\n",
"deep learning functionality to `aeon`, first algorithms for [Deep learning] are\n",
"COMING SOON\n",
Expand All @@ -55,13 +60,13 @@
"\n",
"<img src=\"img/clst_cd.png\" width=\"600\" alt=\"cd_diag\">\n",
"\n"
],
"metadata": {
"collapsed": false
}
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"## References\n",
"\n",
Expand Down Expand Up @@ -141,10 +146,7 @@
"[22] B. Lafabregue, J. Weber, P. Gancarski, and G. Forestier. End-to-end deep\n",
"representation learning for time series clustering: a comparative study. Data Mining\n",
"and Knowledge Discovery, 36:29—-81, 2022\n"
],
"metadata": {
"collapsed": false
}
]
}
],
"metadata": {
Expand Down
249 changes: 249 additions & 0 deletions examples/clustering/sklearn_clustering_with_aeon_distances.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,249 @@
{
Copy link
Member

@MatthewMiddlehurst MatthewMiddlehurst Mar 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X = np.vstack(
   (
       np.random.normal(loc=[2, 2], scale=0.5, size=(50, 2)),
       np.random.normal(loc=[5, 5], scale=0.5, size=(50, 2)),
   )
)

Use time series data like you did previously. As mentioned there are unique challenges when it comes to visualising time series, we are not that interested in regular clustering here.


Reply via ReviewNB

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. I have updated the example to use time series data instead of the randomly generated clusters. Please check.

"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# **Using aeon Distances with scikit-learn Clusterers**\n",
"\n",
"This notebook demonstrates how to integrate aeon’s distance metrics with hierarchical, density-based, and spectral clustering methods from scikit-learn. While aeon primarily supports partition-based clustering algorithms, such as $k$-means and $k$-medoids, its robust distance measures can be leveraged to enable other clustering techniques using scikit-learn.\n",
"\n",
"To measure similarity between time series and enable clustering, we use aeon’s precomputed distance matrices. For details about distance metrics, see the [distance examples](../distances/distances.ipynb).\n",
"\n",
"## **Contents**\n",
"1. **Example Dataset**: Using the `load_unit_test` dataset from aeon.\n",
"2. **Computing Distance Matrices with aeon**: Precomputing distance matrices with aeon’s distance metrics.\n",
"3. **Hierarchical Clustering**\n",
"4. **Density-Based Clustering**\n",
"5. **Spectral Clustering**\n",
"\n",
"## **Example Dataset**\n",
"\n",
"We'll begin by loading a sample dataset. For this demonstration, we'll use the `load_unit_test` dataset from aeon.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Import & load data\n",
"from aeon.datasets import load_unit_test\n",
"\n",
"X, y = load_unit_test(split=\"train\")\n",
"\n",
"print(f\"Data shape: {X.shape}\")\n",
"print(f\"Labels shape: {y.shape}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **Computing Distance Matrices with aeon**\n",
"\n",
"Aeon provides a variety of distance measures suitable for time series data. We'll compute the distance matrix using the Dynamic Time Warping (DTW) distance as an example.\n",
"\n",
"For a comprehensive overview of all available distance metrics in aeon, see the [aeon distances API reference](../api_reference/distances.html).\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from aeon.distances import pairwise_distance\n",
"\n",
"# Compute the pairwise distance matrix using DTW\n",
"distance_matrix = pairwise_distance(X, method=\"dtw\")\n",
"\n",
"print(f\"Distance matrix shape: {distance_matrix.shape}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **Hierarchical Clustering**\n",
"\n",
"[AgglomerativeClustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html) is, as the name suggests, an agglomerative approach that works by merging clusters bottom-up. \n",
" \n",
"\n",
"Hierarchical clustering builds a hierarchy of clusters either by progressively merging or splitting existing clusters. We'll use scikit-learn's AgglomerativeClustering with the precomputed distance matrix.\n",
"\n",
"Not all linkage methods can be used with a precomputed distance matrix. The following linkage methods work with aeon distances:\n",
"- `single`\n",
"- `complete`\n",
"- `average`\n",
"- `weighted`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"from sklearn.cluster import AgglomerativeClustering\n",
"\n",
"# Perform Agglomerative Clustering\n",
"agg_clustering = AgglomerativeClustering(\n",
" n_clusters=2, metric=\"precomputed\", linkage=\"average\"\n",
")\n",
"labels = agg_clustering.fit_predict(distance_matrix)\n",
"\n",
"# Visualize the clustering results\n",
"plt.figure(figsize=(10, 6))\n",
"for label in np.unique(labels):\n",
" plt.plot(\n",
" np.mean(X[labels == label], axis=0), label=f\"Cluster {label}\"\n",
" ) # Fix indexing\n",
"plt.title(\"Hierarchical Clustering with DTW Distance\")\n",
"plt.legend()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **Density-Based Clustering**\n",
"Density-based clustering identifies clusters based on the density of data points in the feature space. We'll demonstrate this using scikit-learn's `DBSCAN` and `OPTICS` algorithms."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### **DBSCAN**\n",
"\n",
"[DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html) is a density-based clustering algorithm that groups data points based on their density connectivity. \n",
"We use the `DBSCAN` algorithm from scikit-learn with a precomputed distance matrix.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.cluster import DBSCAN\n",
"\n",
"# Perform DBSCAN clustering\n",
"dbscan = DBSCAN(eps=0.5, min_samples=5, metric=\"precomputed\")\n",
"dbscan_labels = dbscan.fit_predict(distance_matrix)\n",
"\n",
"# Visualize the clustering results\n",
"plt.figure(figsize=(10, 6))\n",
"for label in np.unique(dbscan_labels):\n",
" if label == -1:\n",
" # Noise points\n",
" plt.plot(X[dbscan_labels == label].mean(axis=0), label=\"Noise\", linestyle=\"--\")\n",
" else:\n",
" plt.plot(X[dbscan_labels == label].mean(axis=0), label=f\"Cluster {label}\")\n",
"plt.title(\"DBSCAN Clustering with DTW Distance\")\n",
"plt.legend()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### **OPTICS**\n",
"[DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html) is a density-based clustering algorithm similar to DBSCAN but provides better handling of varying \n",
"densities. We use the `OPTICS` algorithm from scikit-learn with a precomputed distance matrix."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.cluster import OPTICS\n",
"\n",
"# Perform OPTICS clustering\n",
"optics = OPTICS(min_samples=5, metric=\"precomputed\")\n",
"optics_labels = optics.fit_predict(distance_matrix)\n",
"\n",
"# Visualize the clustering results\n",
"plt.figure(figsize=(10, 6))\n",
"for label in np.unique(optics_labels):\n",
" if label == -1:\n",
" # Noise points\n",
" plt.plot(X[optics_labels == label].mean(axis=0), label=\"Noise\", linestyle=\"--\")\n",
" else:\n",
" plt.plot(X[optics_labels == label].mean(axis=0), label=f\"Cluster {label}\")\n",
"plt.title(\"OPTICS Clustering with DTW Distance\")\n",
"plt.legend()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **Spectral Clustering**\n",
"[SpectralClustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering.html) performs dimensionality reduction on the data before clustering in fewer dimensions. It requires a similarity matrix, so we'll convert our distance matrix accordingly."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"from sklearn.cluster import SpectralClustering\n",
"\n",
"# Ensure the distance matrix does not contain zeros on the diagonal or elsewhere\n",
"# Normalize distance values to [0, 1] and convert to similarities\n",
"inverse_distance_matrix = 1 - (distance_matrix / distance_matrix.max())\n",
"\n",
"# Perform Spectral Clustering with affinity=\"precomputed\"\n",
"spectral = SpectralClustering(n_clusters=2, affinity=\"precomputed\", random_state=42)\n",
"spectral_labels = spectral.fit_predict(inverse_distance_matrix)\n",
"\n",
"# Visualising the clustering results\n",
"plt.figure(figsize=(10, 6))\n",
"for label in np.unique(spectral_labels):\n",
" plt.plot(X[spectral_labels == label].mean(axis=0), label=f\"Cluster {label}\")\n",
"plt.title(\"Spectral Clustering with Normalized Similarity Matrix\")\n",
"plt.legend()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
3 changes: 2 additions & 1 deletion examples/distances/sklearn_distances.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -763,7 +763,8 @@
"collapsed": false
},
"source": [
"## Clustering with sklearn.cluster"
"## Clustering with sklearn.cluster\n",
"[Using aeon Distances with sklearn Clusterers](../clustering/sklearn_clustering_with_aeon_distances.ipynb)"
]
},
{
Expand Down