@@ -155,8 +155,10 @@ cdef class BaseCriterion:
155
155
156
156
This method computes the improvement in impurity when a split occurs.
157
157
The weighted impurity improvement equation is the following:
158
+
158
159
N_t / N * (impurity - N_t_R / N_t * right_impurity
159
160
- N_t_L / N_t * left_impurity)
161
+
160
162
where N is the total number of samples, N_t is the number of samples
161
163
at the current node, N_t_L is the number of samples in the left child,
162
164
and N_t_R is the number of samples in the right child,
@@ -165,8 +167,10 @@ cdef class BaseCriterion:
165
167
----------
166
168
impurity_parent : double
167
169
The initial impurity of the parent node before the split
170
+
168
171
impurity_left : double
169
172
The impurity of the left child
173
+
170
174
impurity_right : double
171
175
The impurity of the right child
172
176
@@ -611,10 +615,13 @@ cdef class Entropy(ClassificationCriterion):
611
615
This handles cases where the target is a classification taking values
612
616
0, 1, ... K-2, K-1. If node m represents a region Rm with Nm observations,
613
617
then let
618
+
614
619
count_k = 1 / Nm \s um_{x_i in Rm} I( yi = k)
620
+
615
621
be the proportion of class k observations in node m.
616
622
617
623
The cross-entropy is then defined as
624
+
618
625
cross-entropy = -\s um_{k=0}^ {K-1} count_k log( count_k)
619
626
"""
620
627
@@ -1058,10 +1065,14 @@ cdef class MSE(RegressionCriterion):
1058
1065
1059
1066
The absolute impurity improvement is only computed by the
1060
1067
impurity_improvement method once the best split has been found.
1068
+
1061
1069
The MSE proxy is derived from
1070
+
1062
1071
sum_{i left}(y_i - y_pred_L)^2 + sum_{i right}(y_i - y_pred_R)^2
1063
1072
= sum(y_i^2) - n_L * mean_{i left}(y_i)^2 - n_R * mean_{i right}(y_i)^2
1073
+
1064
1074
Neglecting constant terms, this gives:
1075
+
1065
1076
- 1/n_L * sum_{i left}(y_i)^2 - 1/n_R * sum_{i right}(y_i)^2
1066
1077
"""
1067
1078
cdef SIZE_t k
@@ -1139,6 +1150,7 @@ cdef class MAE(RegressionCriterion):
1139
1150
----------
1140
1151
n_outputs : SIZE_t
1141
1152
The number of targets to be predicted
1153
+
1142
1154
n_samples : SIZE_t
1143
1155
The total number of samples to fit on
1144
1156
"""
@@ -1429,6 +1441,7 @@ cdef class FriedmanMSE(MSE):
1429
1441
""" Mean squared error impurity criterion with improvement score by Friedman.
1430
1442
1431
1443
Uses the formula (35) in Friedman's original Gradient Boosting paper:
1444
+
1432
1445
diff = mean_left - mean_right
1433
1446
improvement = n_left * n_right * diff^2 / (n_left + n_right)
1434
1447
"""
@@ -1483,6 +1496,7 @@ cdef class Poisson(RegressionCriterion):
1483
1496
""" Half Poisson deviance as impurity criterion.
1484
1497
1485
1498
Poisson deviance = 2/n * sum(y_true * log(y_true/y_pred) + y_pred - y_true)
1499
+
1486
1500
Note that the deviance is >= 0, and since we have `y_pred = mean(y_true)`
1487
1501
at the leaves, one always has `sum(y_pred - y_true) = 0`. It remains the
1488
1502
implemented impurity (factor 2 is skipped):
@@ -1519,12 +1533,16 @@ cdef class Poisson(RegressionCriterion):
1519
1533
1520
1534
The absolute impurity improvement is only computed by the
1521
1535
impurity_improvement method once the best split has been found.
1536
+
1522
1537
The Poisson proxy is derived from:
1538
+
1523
1539
sum_{i left }(y_i * log(y_i / y_pred_L))
1524
1540
+ sum_{i right}(y_i * log(y_i / y_pred_R))
1525
1541
= sum(y_i * log(y_i) - n_L * mean_{i left}(y_i) * log(mean_{i left}(y_i))
1526
1542
- n_R * mean_{i right}(y_i) * log(mean_{i right}(y_i))
1543
+
1527
1544
Neglecting constant terms, this gives
1545
+
1528
1546
- sum{i left }(y_i) * log(mean{i left}(y_i))
1529
1547
- sum{i right}(y_i) * log(mean{i right}(y_i))
1530
1548
"""
0 commit comments