You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Download the dataset from [Pima Indians Diabets Database](https://www.kaggle.com/uciml/pima-indians-diabetes-database).
118
+
Download the dataset from [Pima Indians Diabetes Database](https://www.kaggle.com/uciml/pima-indians-diabetes-database).
119
119
120
120
#### For a desktop application:
121
121
@@ -160,7 +160,7 @@ final samples = DataFrame.fromRawCsv(rawCsvContent);
160
160
161
161
### Prepare datasets for training and testing
162
162
163
-
Data in this file is represented by 768 records and 8 features. 9th column is a label column, it contains either 0 or 1
163
+
Data in this file is represented by 768 records and 8 features. The 9th column is a label column, it contains either 0 or 1
164
164
on each row. This column is our target - we should predict a class label for each observation. The column's name is
165
165
`class variable (0 or 1)`. Let's store it:
166
166
@@ -169,8 +169,8 @@ final targetColumnName = 'class variable (0 or 1)';
169
169
````
170
170
171
171
Now it's the time to prepare data splits. Since we have a smallish dataset (only 768 records), we can't afford to
172
-
split the data into just train and test sets and evaluate the model on them, the best approach in our case is Cross
173
-
Validation. According to this, let's split the data in the following way using the library's [splitData](https://github.com/gyrdym/ml_algo/blob/master/lib/src/model_selection/split_data.dart)
172
+
split the data into just train and test sets and evaluate the model on them, the best approach in our case is Cross-Validation.
173
+
According to this, let's split the data in the following way using the library's [splitData](https://github.com/gyrdym/ml_algo/blob/master/lib/src/model_selection/split_data.dart)
174
174
function:
175
175
176
176
```dart
@@ -179,21 +179,21 @@ final validationData = splits[0];
179
179
final testData = splits[1];
180
180
```
181
181
182
-
`splitData` accepts `DataFrame` instance as the first argument and ratio list as the second one. Now we have 70% of our
183
-
data as a validation set and 30% as a test set for evaluating generalization error.
182
+
`splitData` accepts a `DataFrame` instance as the first argument and ratio list as the second one. Now we have 70% of our
183
+
data as a validation set and 30% as a test set for evaluating generalization errors.
184
184
185
185
### Set up a model selection algorithm
186
186
187
-
Then we may create an instance of `CrossValidator` class to fit [hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning))
187
+
Then we may create an instance of `CrossValidator` class to fit the [hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning))
188
188
of our model. We should pass validation data (our `validationData` variable), and a number of folds into CrossValidator
189
189
constructor.
190
190
191
191
````dart
192
192
final validator = CrossValidator.kFold(validationData, numberOfFolds: 5);
193
193
````
194
194
195
-
Let's create a factory for the classifier with desired hyperparameters. We have to decide after the crossvalidation,
196
-
if the selected hyperparametrs are good enough or not:
195
+
Let's create a factory for the classifier with desired hyperparameters. We have to decide after the cross-validation
196
+
if the selected hyperparameters are good enough or not:
197
197
198
198
```dart
199
199
final createClassifier = (DataFrame samples) =>
@@ -209,13 +209,13 @@ final createClassifier = (DataFrame samples) =>
209
209
```
210
210
211
211
Let's describe our hyperparameters:
212
-
-`optimizerType` - type of optimization algorithm that will be used to learn coefficients of our model, this time we
213
-
decided to use vanilla gradient ascent algorithm
214
-
-`iterationsLimit` - number of learning iterations. Selected optimization algorithm (gradient ascent in our case) will
215
-
be run this amount of times
216
-
-`learningRateType` - a strategy for learning rate update. In our case the learning rate will decrease after every
212
+
-`optimizerType` - a type of optimization algorithm that will be used to learn coefficients of our model, this time we
213
+
decided to use a vanilla gradient ascent algorithm
214
+
-`iterationsLimit` - number of learning iterations. The selected optimization algorithm (gradient ascent in our case) will
215
+
be cyclically run this amount of times
216
+
-`learningRateType` - a strategy for learning rate update. In our case, the learning rate will decrease after every
217
217
iteration
218
-
-`batchSize` - size of data (in rows) that will be used per each iteration. As we have a really small dataset we may use
218
+
-`batchSize` - the size of data (in rows) that will be used per each iteration. As we have a really small dataset we may use
219
219
full-batch gradient ascent, that's why we used `samples.rows.length` here - the total amount of data.
220
220
-`probabilityThreshold` - lower bound for positive label probability
221
221
@@ -233,17 +233,17 @@ final createClassifier = (DataFrame samples) =>
233
233
This argument activates collecting costs per each optimization iteration, and you can see the cost values right after
234
234
the model creation.
235
235
236
-
### Evaluate performance of the model
236
+
### Evaluate the performance of the model
237
237
238
-
Assume, we chose really good hyperprameters. In order to validate this hypothesis let's use CrossValidator instance
238
+
Assume, we chose really good hyperparameters. In order to validate this hypothesis let's use CrossValidator instance
239
239
created before:
240
240
241
241
````dart
242
242
final scores = await validator.evaluate(createClassifier, MetricType.accuracy);
243
243
````
244
244
245
245
Since the CrossValidator instance returns a [Vector](https://github.com/gyrdym/ml_linalg/blob/master/lib/vector.dart) of scores as a result of our predictor evaluation, we may choose
246
-
any way to reduce all the collected scores to a single number, for instance we may use Vector's `mean` method:
246
+
any way to reduce all the collected scores to a single number, for instance, we may use Vector's `mean` method:
247
247
248
248
```dart
249
249
final accuracy = scores.mean();
@@ -260,7 +260,7 @@ We can see something like this:
260
260
accuracy on k fold validation: 0.65
261
261
````
262
262
263
-
Let's assess our hyperparameters on test set in order to evaluate the model's generalization error:
263
+
Let's assess our hyperparameters on the test set in order to evaluate the model's generalization error:
Data in this file is represented by 505 records and 13 features. 14th column is a target. Since we use autoheader, the
450
+
Data in this file is represented by 505 records and 13 features. The 14th column is a target. Since we use autoheader, the
451
451
target's name is autogenerated and it is `col_13`. Let's store it in a variable:
452
452
453
453
````dart
@@ -469,7 +469,7 @@ final trainData = splits[0];
469
469
final testData = splits[1];
470
470
```
471
471
472
-
`splitData` accepts `DataFrame` instance as the first argument and ratio list as the second one. Now we have 80% of our
472
+
`splitData` accepts a `DataFrame` instance as the first argument and ratio list as the second one. Now we have 80% of our
473
473
data as a train set and 20% as a test set.
474
474
475
475
Let's train the model:
@@ -478,7 +478,7 @@ Let's train the model:
478
478
final model = LinearRegressor(trainData, targetName);
479
479
```
480
480
481
-
By default, `LinearRegressor` uses closed-form solution to train the model. One can also use a different solution type,
481
+
By default, `LinearRegressor` uses a closed-form solution to train the model. One can also use a different solution type,
482
482
e.g. stochastic gradient descent algorithm:
483
483
484
484
```dart
@@ -580,15 +580,15 @@ void main() async {
580
580
````
581
581
</details>
582
582
583
-
## Decision treebased classification
583
+
## Decision tree-based classification
584
584
585
585
Let's try to classify data from a well-known [Iris](https://www.kaggle.com/datasets/uciml/iris) dataset using a non-linear algorithm - [decision trees](https://en.wikipedia.org/wiki/Decision_tree)
586
586
587
587
First, you need to download the data and place it in a proper place in your file system. To do so you should follow the
588
-
instructions which are given in [Logistic regression](#logistic-regression) section.
588
+
instructions which are given in the [Logistic regression](#logistic-regression) section.
589
589
590
-
After loading the data, it's needed to preprocess it. We should drop `Id` column since the column doesn't make sense.
591
-
Also, we need to encode 'Species' column - originally, it contains 3 repeated string labels, to feed it to the classifier
590
+
After loading the data, it's needed to preprocess it. We should drop the `Id` column since the column doesn't make sense.
591
+
Also, we need to encode the 'Species' column - originally, it contains 3 repeated string labels, to feed it to the classifier
592
592
it's needed to convert the labels into numbers:
593
593
594
594
```dart
@@ -630,28 +630,28 @@ parameters in more detail:
630
630
631
631
All the parameters serve as stopping criteria for the tree building algorithm.
632
632
633
-
Now we have a ready to use model. As usual, we can save the model to a JSON-file:
633
+
Now we have a ready to use model. As usual, we can save the model to a JSONfile:
634
634
635
635
```dart
636
636
await model.saveAsJson('path/to/json/file.json');
637
637
```
638
638
639
-
Unlike other models, in case of decision tree we can visualise the algorithm result - we can save the model as SVG-file:
639
+
Unlike other models, in the case of a decision tree, we can visualise the algorithm result - we can save the model as an SVGfile:
640
640
641
641
```dart
642
642
await model.saveAsSvg('path/to/svg/file.svg');
643
643
```
644
644
645
-
Once we saved it, we can open the file through any image viewer, e.g. through a web-browser. An example of the
646
-
resulting svg-image:
645
+
Once we saved it, we can open the file through any image viewer, e.g. through a webbrowser. An example of the
0 commit comments