aeon-toolkit · SalmanDeveloperz · Jan 22, 2025 · Jan 26, 2025 · Jan 26, 2025 · Jan 26, 2025
diff --git a/examples/clustering/sklearn_clustering_with_aeon_distances.ipynb b/examples/clustering/sklearn_clustering_with_aeon_distances.ipynb
@@ -0,0 +1,274 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# **Using aeon Distances with scikit-learn Clusterers**\n",
+    "\n",
+    "This notebook demonstrates how to integrate aeon’s distance metrics with hierarchical, density-based, and spectral clustering methods from scikit-learn. While aeon primarily supports partition-based clustering algorithms, such as $k$-means and $k$-medoids, its robust distance measures can be leveraged to enable other clustering techniques using scikit-learn.\n",
+    "\n",
+    "Broadly, clustering algorithms can be categorized into partition-based, hierarchical, density-based, and spectral methods. In this notebook, we focus on using aeon’s distance metrics with:\n",
+    "1. **Hierarchical clustering**: `AgglomerativeClustering` with `metric=\"precomputed\"`.\n",
+    "2. **Density-based clustering**: `DBSCAN` and `OPTICS` with `metric=\"precomputed\"`.\n",
+    "3. **Spectral clustering**: `SpectralClustering` with `affinity=\"precomputed\"` and the inverse of the distance matrix.\n",
+    "\n",
+    "To measure similarity between time series and enable clustering, we use aeon’s precomputed distance matrices. For details about distance metrics, see the [distance examples](../distances/distances.ipynb).\n",
+    "\n",
+    "## **Contents**\n",
+    "1. **Introduction**: Overview of clustering methods and motivation for this notebook.\n",
+    "2. **Loading Data**: Using the `load_unit_test` dataset from aeon.\n",
+    "3. **Computing Distance Matrices with aeon**: Precomputing distance matrices with aeon’s distance metrics.\n",
+    "\n",
+    "4. **Hierarchical Clustering**\n",
+    "   4.1 sklearn.cluster.AgglomerativeClustering with metric=\"precomputed\"\n",
+    "\n",
+    "5. **Density-Based Clustering**\n",
+    "   5.1 sklearn.cluster.DBSCAN with metric=\"precomputed\"\n",
+    "   5.2 sklearn.cluster.OPTICS with metric=\"precomputed\"\n",
+    "\n",
+    "6. **Spectral Clustering**\n",
+    "   6.1 sklearn.cluster.SpectralClustering with affinity=\"precomputed\"\n",
+    "   6.2 Using the Inverse of the Distance Matrix\n",
+    "\n",
+    "## **Introduction**\n",
+    "\n",
+    "While aeon primarily focuses on partition-based clustering methods, it's possible to extend its capabilities by integrating its distance metrics with scikit-learn's clustering algorithms. This approach allows us to perform hierarchical, density-based, and spectral clustering on time series data using aeon's rich set of distance measures.\n",
+    "\n",
+    "## **Loading Data**\n",
+    "\n",
+    "We'll begin by loading a sample dataset. For this demonstration, we'll use the `load_unit_test` dataset from aeon."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Import & load data\n",
+    "from aeon.datasets import load_unit_test\n",
+    "X_train, y_train = load_unit_test(split=\"train\")\n",
+    "X_test, y_test = load_unit_test(split=\"test\")\n",
+    "\n",
+    "# For simplicity, we'll work with the training data\n",
+    "X = X_train\n",
+    "y = y_train\n",
+    "\n",
+    "print(f\"Data shape: {X.shape}\")\n",
+    "print(f\"Labels shape: {y.shape}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## **Computing Distance Matrices with aeon**\n",
+    "Aeon provides a variety of distance measures suitable for time series data. We'll compute the distance matrix using the Dynamic Time Warping (DTW) distance as an example."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from aeon.distances import pairwise_distance\n",
+    "\n",
+    "# Compute the pairwise distance matrix using DTW\n",
+    "distance_matrix = pairwise_distance(X, metric=\"dtw\")\n",
+    "\n",
+    "print(f\"Distance matrix shape: {distance_matrix.shape}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## **Hierarchical Clustering**\n",
+    "Hierarchical clustering builds a hierarchy of clusters either by progressively merging or splitting existing clusters. We'll use scikit-learn's AgglomerativeClustering with the precomputed distance matrix."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.cluster import AgglomerativeClustering\n",
+    "import matplotlib.pyplot as plt\n",
+    "import numpy as np\n",
+    "\n",
+    "# Perform Agglomerative Clustering\n",
+    "agg_clustering = AgglomerativeClustering(\n",
+    "    n_clusters=2, affinity=\"precomputed\", linkage=\"average\"\n",
+    ")\n",
+    "labels = agg_clustering.fit_predict(distance_matrix)\n",
+    "\n",
+    "# Visualize the clustering results\n",
+    "plt.figure(figsize=(10, 6))\n",
+    "for label in np.unique(labels):\n",
+    "    plt.plot(X[labels == label].mean(axis=0), label=f\"Cluster {label}\")\n",
+    "plt.title(\"Hierarchical Clustering with DTW Distance\")\n",
+    "plt.legend()\n",
+    "plt.show()\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## **Density-Based Clustering**\n",
+    "Density-based clustering identifies clusters based on the density of data points in the feature space. We'll demonstrate this using scikit-learn's `DBSCAN` and `OPTICS` algorithms."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### **DBSCAN**\n",
+    "\n",
+    "DBSCAN is a density-based clustering algorithm that groups data points based on their density connectivity. \n",
+    "We use the `DBSCAN` algorithm from scikit-learn with a precomputed distance matrix.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.cluster import DBSCAN\n",
+    "\n",
+    "# Perform DBSCAN clustering\n",
+    "dbscan = DBSCAN(eps=0.5, min_samples=5, metric=\"precomputed\")\n",
+    "dbscan_labels = dbscan.fit_predict(distance_matrix)\n",
+    "\n",
+    "# Visualize the clustering results\n",
+    "plt.figure(figsize=(10, 6))\n",
+    "for label in np.unique(dbscan_labels):\n",
+    "    if label == -1:\n",
+    "        # Noise points\n",
+    "        plt.plot(X[dbscan_labels == label].mean(axis=0), label=\"Noise\", linestyle=\"--\")\n",
+    "    else:\n",
+    "        plt.plot(X[dbscan_labels == label].mean(axis=0), label=f\"Cluster {label}\")\n",
+    "plt.title(\"DBSCAN Clustering with DTW Distance\")\n",
+    "plt.legend()\n",
+    "plt.show()\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### **OPTICS**\n",
+    "OPTICS is a density-based clustering algorithm similar to DBSCAN but provides better handling of varying \n",
+    "densities. We use the `OPTICS` algorithm from scikit-learn with a precomputed distance matrix."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.cluster import OPTICS\n",
+    "\n",
+    "# Perform OPTICS clustering\n",
+    "optics = OPTICS(min_samples=5, metric=\"precomputed\")\n",
+    "optics_labels = optics.fit_predict(distance_matrix)\n",
+    "\n",
+    "# Visualize the clustering results\n",
+    "plt.figure(figsize=(10, 6))\n",
+    "for label in np.unique(optics_labels):\n",
+    "    if label == -1:\n",
+    "        # Noise points\n",
+    "        plt.plot(X[optics_labels == label].mean(axis=0), label=\"Noise\", linestyle=\"--\")\n",
+    "    else:\n",
+    "        plt.plot(X[optics_labels == label].mean(axis=0), label=f\"Cluster {label}\")\n",
+    "plt.title(\"OPTICS Clustering with DTW Distance\")\n",
+    "plt.legend()\n",
+    "plt.show()\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## **Spectral Clustering**\n",
+    "Spectral clustering performs dimensionality reduction on the data before clustering in fewer dimensions. It requires a similarity matrix, so we'll convert our distance matrix accordingly."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.cluster import SpectralClustering\n",
+    "import numpy as np\n",
+    "\n",
+    "# Ensure the distance matrix does not contain zeros on the diagonal or elsewhere\n",
+    "# Now adding a small constant to avoid division by zero\n",
+    "epsilon = 1e-10\n",
+    "inverse_distance_matrix = 1 /(distance_matrix + epsilon)\n",
+    "\n",
+    "# Perform Spectral Clustering with affinity=\"precomputed\"\n",
+    "spectral = SpectralClustering(\n",
+    "    n_clusters=2, affinity=\"precomputed\", random_state=42\n",
+    ")\n",
+    "spectral_labels = spectral.fit_predict(inverse_distance_matrix)\n",
+    "\n",
+    "# Visualising the clustering results\n",
+    "plt.figure(figsize=(10, 6))\n",
+    "for label in np.unique(spectral_labels):\n",
+    "    plt.plot(X[spectral_labels == label].mean(axis=0), label=f\"Cluster {label}\")\n",
+    "plt.title(\"Spectral Clustering with Inverse Distance Matrix\")\n",
+    "plt.legend()\n",
+    "plt.show()\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## **References**\n",
+    "\n",
+    "[1] Christopher Holder, Matthew Middlehurst, and Anthony Bagnall. A Review and Evaluation of Elastic Distance Functions for Time Series Clustering, Knowledge and Information Systems. In Press (2023).\n",
+    "\n",
+    "[2] Christopher Holder, David Guijo-Rubio, and Anthony Bagnall. Barycentre averaging for the move-split-merge time series distance measure. 15th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (2023).\n",
+    "\n",
+    "[3] Kaufman, Leonard & Rousseeuw, Peter. (1986). Clustering Large Data Sets. 10.1016/B978-0-444-87877-9.50039-X.\n",
+    "\n",
+    "[4] R. T. Ng and Jiawei Han. \"CLARANS: a method for clustering objects spatial data mining.\" IEEE Transactions on Knowledge and Data Engineering vol. 14, no. 5, pp. 1003-1016, Sept.-Oct. 2002, doi: 10.1109/TKDE.2002.1033770.\n",
+    "\n",
+    "[5] Paparrizos, John, and Luis Gravano. \"Fast and Accurate Time-Series Clustering.\" ACM Transactions on Database Systems 42, no. 2 (2017): 8:1-8:49.\n",
+    "\n",
+    "[6] F. Petitjean, A. Ketterlin and P. Gancarski. “A global averaging method for dynamic time warping, with applications to clustering,” Pattern Recognition, vol. 44, pp. 678-693, 2011.\n",
+    "\n",
+    "Generated using [nbsphinx](https://nbsphinx.readthedocs.io/). The Jupyter notebook can be found here[here](sklearn_clustering_with_aeon_distances.html)."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}