[MRG] Extended explanation of using class_weight in RandomForestClassifier (Issue scikit-learn#6646) (scikit-learn#8838)

Lera · jnothman · commit 8717a159ab6a · 2017-05-09T18:21:12.000+10:00
* Extended explanation of using class_weight in RandomForestClassifier

* Extended explanation of using class_weight in DecisionTreeClassifier,ExtraTreesClassifier and compute_sample_weight()

* Rephrased description.

* Rephrased description (remove "indicator")
diff --git a/sklearn/ensemble/forest.py b/sklearn/ensemble/forest.py
@@ -863,6 +863,12 @@ class RandomForestClassifier(ForestClassifier):
         multi-output problems, a list of dicts can be provided in the same
         order as the columns of y.
 
+        Note that for multioutput (including multilabel) weights should be
+        defined for each class of every column in its own dict. For example,
+        for four-class multilabel classification weights should be
+        [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of
+        [{1:1}, {2:5}, {3:1}, {4:1}].
+
         The "balanced" mode uses the values of y to automatically adjust
         weights inversely proportional to class frequencies in the input data
         as ``n_samples / (n_classes * np.bincount(y))``
@@ -1306,6 +1312,12 @@ class ExtraTreesClassifier(ForestClassifier):
         multi-output problems, a list of dicts can be provided in the same
         order as the columns of y.
 
+        Note that for multioutput (including multilabel) weights should be
+        defined for each class of every column in its own dict. For example,
+        for four-class multilabel classification weights should be
+        [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of
+        [{1:1}, {2:5}, {3:1}, {4:1}].
+
         The "balanced" mode uses the values of y to automatically adjust
         weights inversely proportional to class frequencies in the input data
         as ``n_samples / (n_classes * np.bincount(y))``
diff --git a/sklearn/tree/tree.py b/sklearn/tree/tree.py
@@ -587,6 +587,12 @@ class DecisionTreeClassifier(BaseDecisionTree, ClassifierMixin):
         multi-output problems, a list of dicts can be provided in the same
         order as the columns of y.
 
+        Note that for multioutput (including multilabel) weights should be
+        defined for each class of every column in its own dict. For example,
+        for four-class multilabel classification weights should be
+        [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of
+        [{1:1}, {2:5}, {3:1}, {4:1}].
+
         The "balanced" mode uses the values of y to automatically adjust
         weights inversely proportional to class frequencies in the input data
         as ``n_samples / (n_classes * np.bincount(y))``
diff --git a/sklearn/utils/class_weight.py b/sklearn/utils/class_weight.py
@@ -84,6 +84,12 @@ def compute_sample_weight(class_weight, y, indices=None):
         multi-output problems, a list of dicts can be provided in the same
         order as the columns of y.
 
+        Note that for multioutput (including multilabel) weights should be
+        defined for each class of every column in its own dict. For example,
+        for four-class multilabel classification weights should be
+        [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of
+        [{1:1}, {2:5}, {3:1}, {4:1}].
+
         The "balanced" mode uses the values of y to automatically adjust
         weights inversely proportional to class frequencies in the input data:
         ``n_samples / (n_classes * np.bincount(y))``.