|
1 | 1 | """
|
2 |
| -========================================== |
3 |
| -Target Encoder's Internal Cross Validation |
4 |
| -========================================== |
| 2 | +======================================= |
| 3 | +Target Encoder's Internal Cross fitting |
| 4 | +======================================= |
5 | 5 |
|
6 | 6 | .. currentmodule:: sklearn.preprocessing
|
7 | 7 |
|
8 | 8 | The :class:`TargetEnocoder` replaces each category of a categorical feature with
|
9 | 9 | the mean of the target variable for that category. This method is useful
|
10 | 10 | in cases where there is a strong relationship between the categorical feature
|
11 | 11 | and the target. To prevent overfitting, :meth:`TargetEncoder.fit_transform` uses
|
12 |
| -interval cross validation to encode the training data to be used by a downstream |
13 |
| -model. In this example, we demonstrate the importance of the cross validation |
| 12 | +an internal cross fitting scheme to encode the training data to be used by a |
| 13 | +downstream model. In this example, we demonstrate the importance of the cross fitting |
14 | 14 | procedure to prevent overfitting.
|
15 | 15 | """
|
16 | 16 |
|
|
49 | 49 |
|
50 | 50 | # %%
|
51 | 51 | # The uninformative feature with high cardinality is generated so that is independent of
|
52 |
| -# the target variable. We will show that target encoding without cross validation will |
| 52 | +# the target variable. We will show that target encoding without cross fitting will |
53 | 53 | # cause catastrophic overfitting for the downstream regressor. These high cardinality
|
54 | 54 | # features are basically unique identifiers for samples which should generally be
|
55 | 55 | # removed from machine learning dataset. In this example, we generate them to show how
|
56 |
| -# :class:`TargetEncoder`'s default cross validation behavior mitigates the overfitting |
| 56 | +# :class:`TargetEncoder`'s default cross fitting behavior mitigates the overfitting |
57 | 57 | # issue automatically.
|
58 | 58 | X_near_unique_categories = rng.choice(
|
59 | 59 | int(0.9 * n_samples), size=n_samples, replace=True
|
|
79 | 79 | # ==========================
|
80 | 80 | # In this section, we train a ridge regressor on the dataset with and without
|
81 | 81 | # encoding and explore the influence of target encoder with and without the
|
82 |
| -# interval cross validation. First, we see the Ridge model trained on the |
| 82 | +# internal cross fitting. First, we see the Ridge model trained on the |
83 | 83 | # raw features will have low performance, because the order of the informative
|
84 | 84 | # feature is not informative:
|
85 | 85 | import sklearn
|
|
96 | 96 |
|
97 | 97 | # %%
|
98 | 98 | # Next, we create a pipeline with the target encoder and ridge model. The pipeline
|
99 |
| -# uses :meth:`TargetEncoder.fit_transform` which uses cross validation. We see that |
| 99 | +# uses :meth:`TargetEncoder.fit_transform` which uses cross fitting. We see that |
100 | 100 | # the model fits the data well and generalizes to the test set:
|
101 | 101 | from sklearn.pipeline import make_pipeline
|
102 | 102 | from sklearn.preprocessing import TargetEncoder
|
|
120 | 120 | _ = coefs_cv.plot(kind="barh")
|
121 | 121 |
|
122 | 122 | # %%
|
123 |
| -# While :meth:`TargetEncoder.fit_transform` uses an interval cross validation, |
124 |
| -# :meth:`TargetEncoder.transform` itself does not perform any cross validation. |
| 123 | +# While :meth:`TargetEncoder.fit_transform` uses an internal cross fitting scheme, |
| 124 | +# :meth:`TargetEncoder.transform` itself does not perform any cross fitting. |
125 | 125 | # It uses the aggregation of the complete training set to transform the categorical
|
126 | 126 | # features. Thus, we can use :meth:`TargetEncoder.fit` followed by
|
127 |
| -# :meth:`TargetEncoder.transform` to disable the cross validation. This encoding |
| 127 | +# :meth:`TargetEncoder.transform` to disable the cross fitting. This encoding |
128 | 128 | # is then passed to the ridge model.
|
129 | 129 | target_encoder = TargetEncoder(random_state=0)
|
130 | 130 | target_encoder.fit(X_train, y_train)
|
|
154 | 154 | # %%
|
155 | 155 | # Conclusion
|
156 | 156 | # ==========
|
157 |
| -# This example demonstrates the importance of :class:`TargetEncoder`'s interval cross |
158 |
| -# validation. It is important to use :meth:`TargetEncoder.fit_transform` to encode |
| 157 | +# This example demonstrates the importance of :class:`TargetEncoder`'s internal cross |
| 158 | +# fitting. It is important to use :meth:`TargetEncoder.fit_transform` to encode |
159 | 159 | # training data before passing it to a machine learning model. When a
|
160 | 160 | # :class:`TargetEncoder` is a part of a :class:`~sklearn.pipeline.Pipeline` and the
|
161 | 161 | # pipeline is fitted, the pipeline will correctly call
|
|
0 commit comments