Skip to content

Commit 2c0cdd4

Browse files
DOC fix typo and some refinement in plot_permutation_importance example (scikit-learn#30939)
Co-authored-by: Lucy Liu <jliu176@gmail.com>
1 parent 5f50a87 commit 2c0cdd4

File tree

1 file changed

+15
-14
lines changed

1 file changed

+15
-14
lines changed

examples/inspection/plot_permutation_importance.py

Lines changed: 15 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -95,11 +95,15 @@
9595
# %%
9696
# Accuracy of the Model
9797
# ---------------------
98-
# Prior to inspecting the feature importances, it is important to check that
99-
# the model predictive performance is high enough. Indeed there would be little
100-
# interest of inspecting the important features of a non-predictive model.
101-
#
102-
# Here one can observe that the train accuracy is very high (the forest model
98+
# Before inspecting the feature importances, it is important to check that
99+
# the model predictive performance is high enough. Indeed, there would be little
100+
# interest in inspecting the important features of a non-predictive model.
101+
102+
print(f"RF train accuracy: {rf.score(X_train, y_train):.3f}")
103+
print(f"RF test accuracy: {rf.score(X_test, y_test):.3f}")
104+
105+
# %%
106+
# Here, one can observe that the train accuracy is very high (the forest model
103107
# has enough capacity to completely memorize the training set) but it can still
104108
# generalize well enough to the test set thanks to the built-in bagging of
105109
# random forests.
@@ -110,12 +114,9 @@
110114
# ``min_samples_leaf=10``) so as to limit overfitting while not introducing too
111115
# much underfitting.
112116
#
113-
# However let's keep our high capacity random forest model for now so as to
114-
# illustrate some pitfalls with feature importance on variables with many
117+
# However, let us keep our high capacity random forest model for now so that we can
118+
# illustrate some pitfalls about feature importance on variables with many
115119
# unique values.
116-
print(f"RF train accuracy: {rf.score(X_train, y_train):.3f}")
117-
print(f"RF test accuracy: {rf.score(X_test, y_test):.3f}")
118-
119120

120121
# %%
121122
# Tree's Feature Importance from Mean Decrease in Impurity (MDI)
@@ -135,7 +136,7 @@
135136
#
136137
# The bias towards high cardinality features explains why the `random_num` has
137138
# a really large importance in comparison with `random_cat` while we would
138-
# expect both random features to have a null importance.
139+
# expect that both random features have a null importance.
139140
#
140141
# The fact that we use training set statistics explains why both the
141142
# `random_num` and `random_cat` features have a non-null importance.
@@ -155,11 +156,11 @@
155156
# %%
156157
# As an alternative, the permutation importances of ``rf`` are computed on a
157158
# held out test set. This shows that the low cardinality categorical feature,
158-
# `sex` and `pclass` are the most important feature. Indeed, permuting the
159-
# values of these features will lead to most decrease in accuracy score of the
159+
# `sex` and `pclass` are the most important features. Indeed, permuting the
160+
# values of these features will lead to the most decrease in accuracy score of the
160161
# model on the test set.
161162
#
162-
# Also note that both random features have very low importances (close to 0) as
163+
# Also, note that both random features have very low importances (close to 0) as
163164
# expected.
164165
from sklearn.inspection import permutation_importance
165166

0 commit comments

Comments
 (0)