KDTree exported as a separate library; an example added to README (#224)

gyrdym · gyrdym · commit 780b8a62e6e7 · 2022-05-01T23:56:12.000+03:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,9 @@
 # Changelog
 
+## 16.11.1
+- `KDTree` example added to README
+- `kd_tree` exported as a separate library
+
 ## 16.11.0
 - `ml_preprocessing` version upgraded to 7.0.2
 - `ml_dataframe` version upgraded to 1.0.0
diff --git a/README.md b/README.md
@@ -20,6 +20,7 @@ The library is a part of the ecosystem:
     - [Logistic regression](#logistic-regression)
     - [Linear regression](#linear-regression)
     - [Decision tree-based classification](#decision-tree-based-classification)
+    - [KDTree-based data retrieval](#kdtree-based-data-retrieval)
 - [Models retraining](#models-retraining)
 - [Notes on gradient-based optimisation algorithms](#a-couple-of-words-about-linear-models-which-use-gradient-optimisation-methods)
 
@@ -31,7 +32,7 @@ The main purpose of the library is to give native Dart implementation of machine
 interested both in Dart language and data science. This library aims at Dart VM and Flutter, it's impossible to use 
 it in web applications.
 
-## The library's content
+## The library content
 
 - #### Model selection
     - [CrossValidator](https://github.com/gyrdym/ml_algo/blob/master/lib/src/model_selection/cross_validator/cross_validator.dart). 
@@ -70,7 +71,8 @@ it in web applications.
     training data. It may catch non-linear patterns of the data.
     
 - #### Clustering and retrieval algorithms
-    - [KDTree](https://github.com/gyrdym/ml_algo/blob/master/lib/src/retrieval/kd_tree/kd_tree.dart)
+    - [KDTree](https://github.com/gyrdym/ml_algo/blob/master/lib/src/retrieval/kd_tree/kd_tree.dart) An algorithm for
+    efficient data retrieval.
     
 For more information on the library's API, please visit the [API reference](https://pub.dev/documentation/ml_algo/latest/ml_algo/ml_algo-library.html) 
 
@@ -580,7 +582,7 @@ void main() async {
 ````
 </details>
 
-## Decision tree-based classification
+### Decision tree-based classification
 
 Let's try to classify data from a well-known [Iris](https://www.kaggle.com/datasets/uciml/iris) dataset using a non-linear algorithm - [decision trees](https://en.wikipedia.org/wiki/Decision_tree)
 
@@ -649,6 +651,119 @@ resulting SVG image:
     <img height="600" src="https://raw.github.com/gyrdym/ml_algo/master/e2e/decision_tree_classifier/iris_tree.svg?sanitize=true"> 
 </p>
 
+### KDTree-based data retrieval
+
+Let's take a look at another field of machine learning - data retrieval. The field is represented by a family of algorithms,
+one of them is `KDTree` which is exposed by the library.
+
+`KDTree` is an algorithm that divides the whole search space into partitions in form of the binary tree which makes it 
+efficient to retrieve data.
+
+Let's retrieve some data points through a kd-tree built on the [Iris](https://www.kaggle.com/datasets/uciml/iris) dataset.
+
+First, we need to prepare the data. To do so, it's needed to load the dataset. For this purpose, we may use 
+[loadIrisDataset](https://pub.dev/documentation/ml_dataframe/latest/ml_dataframe/loadIrisDataset.html) function from [ml_dataframe](https://pub.dev/packages/ml_dataframe). The function returns prefilled with the Iris data DataFrame instance:
+
+```dart
+import 'package:ml_algo/ml_algo.dart';
+import 'package:ml_dataframe/ml_dataframe.dart';
+
+void main() async {
+  final originalData = await loadIrisDataset();
+}
+```
+
+Since the dataset contains `Id` column that doesn't make sense and `Species` column that contains text data, we need to
+drop these columns:
+
+```dart
+import 'package:ml_algo/ml_algo.dart';
+import 'package:ml_dataframe/ml_dataframe.dart';
+
+void main() async {
+  final originalData = await loadIrisDataset();
+  final data = originalData.dropSeries(names: ['Id', 'Species']);
+}
+```
+
+Next, we can build the tree:
+
+```dart
+import 'package:ml_algo/ml_algo.dart';
+import 'package:ml_dataframe/ml_dataframe.dart';
+
+void main() async {
+  final originalData = await loadIrisDataset();
+  final data = originalData.dropSeries(names: ['Id', 'Species']);
+  final tree = KDTree(data);
+}
+```
+
+And query nearest neighbours for an arbitrary point. Let's say, we want to find 5 nearest neighbours for the point `[6.5, 3.01, 4.5, 1.5]`:
+
+```dart
+import 'package:ml_algo/ml_algo.dart';
+import 'package:ml_dataframe/ml_dataframe.dart';
+import 'package:ml_linalg/vector.dart';
+
+void main() async {
+  final originalData = await loadIrisDataset();
+  final data = originalData.dropSeries(names: ['Id', 'Species']);
+  final tree = KDTree(data);
+  final neighbourCount = 5;
+  final point = Vector.fromList([6.5, 3.01, 4.5, 1.5]);
+  final neighbours = tree.query(point, neighbourCount);
+ 
+  print(neighbours);
+}
+```
+
+The last instruction prints the following:
+
+```
+(Index: 75, Distance: 0.17349341930302867), (Index: 51, Distance: 0.21470911402365767), (Index: 65, Distance: 0.26095956499211426), (Index: 86, Distance: 0.29681616124778537), (Index: 56, Distance: 0.4172527193942372))
+```
+
+The nearest point has an index 75 in the original data. Let's check a record at the index:
+
+```dart
+import 'package:ml_dataframe/ml_dataframe.dart';
+
+void main() async {
+  final originalData = await loadIrisDataset();
+ 
+  print(originalData.rows.elementAt(75));
+}
+```
+
+It prints the following:
+
+```
+(76, 6.6, 3.0, 4.4, 1.4, Iris-versicolor)
+```
+
+Remember, we dropped `Id` and `Species` columns which are the very first and the very last elements in the output, so the
+rest elements, `6.6, 3.0, 4.4, 1.4` look quite similar to our target point - `6.5, 3.01, 4.5, 1.5`, so the query result makes 
+sense. 
+
+If you want to use `KDTree` outside the ml_algo ecosystem, meaning you don't want to use [ml_linalg](https://pub.dev/packages/ml_linalg) and [ml_dataframe](https://pub.dev/packages/ml_dataframe)
+packages in your application, you may import only `KDTree` library and use `fromIterable` constructor and `queryIterable`
+method to perform the query: 
+
+```dart
+import 'package:ml_algo/kd_tree.dart';
+
+void main() async {
+  final tree = KDTree.fromIterable([
+    // some data here
+  ]);
+  final neighbourCount = 5;
+  final neighbours = tree.queryIterable([/* some point here */], neighbourCount);
+ 
+  print(neighbours);
+}
+```
+
 ## Models retraining
 
 Someday our previously shining model can degrade in terms of prediction accuracy - in this case, we can retrain it. 
diff --git a/e2e/kd_tree/kd_tree_test.dart b/e2e/kd_tree/kd_tree_test.dart
@@ -0,0 +1,34 @@
+import 'package:ml_algo/ml_algo.dart';
+import 'package:ml_dataframe/ml_dataframe.dart';
+import 'package:ml_linalg/dtype.dart';
+import 'package:ml_linalg/vector.dart';
+import 'package:test/test.dart';
+
+void main() async {
+  group('KDTree', () {
+    test('should return correct list of neighbours, dtype=DType.float32',
+        () async {
+      final originalData = await loadIrisDataset();
+      final data = originalData.dropSeries(names: ['Id', 'Species']);
+      final tree = KDTree(data);
+      final neighbours = tree.query(Vector.fromList([6.5, 3.01, 4.5, 1.5]), 5);
+
+      expect(neighbours, hasLength(5));
+      expect(neighbours.toString(),
+          '((Index: 75, Distance: 0.17349341930302867), (Index: 51, Distance: 0.21470911402365767), (Index: 65, Distance: 0.26095956499211426), (Index: 86, Distance: 0.29681616124778537), (Index: 56, Distance: 0.4172527193942372))');
+    });
+
+    test('should return correct list of neighbours, dtype=DType.float64',
+        () async {
+      final originalData = await loadIrisDataset();
+      final data = originalData.dropSeries(names: ['Id', 'Species']);
+      final tree = KDTree(data, dtype: DType.float64);
+      final neighbours = tree.query(
+          Vector.fromList([6.5, 3.01, 4.5, 1.5], dtype: DType.float64), 5);
+
+      expect(neighbours, hasLength(5));
+      expect(neighbours.toString(),
+          '((Index: 75, Distance: 0.17349351572897434), (Index: 51, Distance: 0.21470910553583905), (Index: 65, Distance: 0.2609597670139979), (Index: 86, Distance: 0.29681644159311693), (Index: 56, Distance: 0.41725292090050153))');
+    });
+  });
+}
diff --git a/lib/kd_tree.dart b/lib/kd_tree.dart
@@ -0,0 +1 @@
+export 'package:ml_algo/src/retrieval/kd_tree/kd_tree.dart';
diff --git a/pubspec.yaml b/pubspec.yaml
@@ -1,6 +1,6 @@
 name: ml_algo
 description: Machine learning algorithms, Machine learning models performance evaluation functionality
-version: 16.11.0
+version: 16.11.1
 homepage: https://github.com/gyrdym/ml_algo
 
 environment:
@@ -10,7 +10,7 @@ dependencies:
   collection: ^1.16.0
   injector: ^2.0.0
   json_annotation: ^4.0.0
-  ml_dataframe: ^1.0.0
+  ml_dataframe: ^1.4.2
   ml_linalg: ^13.7.0
   ml_preprocessing: ^7.0.2
   quiver: ^3.0.0

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+export 'package:ml_algo/src/retrieval/kd_tree/kd_tree.dart';`