Skip to content

Commit 8ccaf0d

Browse files
StefanieSengerArturoAmorQadrinjalaliogrisel
authored
DOC Add links to preprocessing examples in docstrings and userguide (scikit-learn#26877)
Co-authored-by: Arturo Amor <86408019+ArturoAmorQ@users.noreply.github.com> Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
1 parent 4906ce5 commit 8ccaf0d

File tree

6 files changed

+78
-50
lines changed

6 files changed

+78
-50
lines changed

doc/modules/preprocessing.rst

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,10 @@ The ``sklearn.preprocessing`` package provides several common
1010
utility functions and transformer classes to change raw feature vectors
1111
into a representation that is more suitable for the downstream estimators.
1212

13-
In general, learning algorithms benefit from standardization of the data set. If
14-
some outliers are present in the set, robust scalers or transformers are more
15-
appropriate. The behaviors of the different scalers, transformers, and
13+
In general, many learning algorithms such as linear models benefit from standardization of the data set
14+
(see :ref:`sphx_glr_auto_examples_preprocessing_plot_scaling_importance.py`).
15+
If some outliers are present in the set, robust scalers or other transformers can
16+
be more appropriate. The behaviors of the different scalers, transformers, and
1617
normalizers on a dataset containing marginal outliers is highlighted in
1718
:ref:`sphx_glr_auto_examples_preprocessing_plot_all_scaling.py`.
1819

examples/preprocessing/plot_all_scaling.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -265,6 +265,8 @@ def make_plot(item_idx):
265265
make_plot(0)
266266

267267
# %%
268+
# .. _plot_all_scaling_standard_scaler_section:
269+
#
268270
# StandardScaler
269271
# --------------
270272
#
@@ -285,6 +287,8 @@ def make_plot(item_idx):
285287
make_plot(1)
286288

287289
# %%
290+
# .. _plot_all_scaling_minmax_scaler_section:
291+
#
288292
# MinMaxScaler
289293
# ------------
290294
#
@@ -301,6 +305,8 @@ def make_plot(item_idx):
301305
make_plot(2)
302306

303307
# %%
308+
# .. _plot_all_scaling_max_abs_scaler_section:
309+
#
304310
# MaxAbsScaler
305311
# ------------
306312
#
@@ -318,6 +324,8 @@ def make_plot(item_idx):
318324
make_plot(3)
319325

320326
# %%
327+
# .. _plot_all_scaling_robust_scaler_section:
328+
#
321329
# RobustScaler
322330
# ------------
323331
#
@@ -335,6 +343,8 @@ def make_plot(item_idx):
335343
make_plot(4)
336344

337345
# %%
346+
# .. _plot_all_scaling_power_transformer_section:
347+
#
338348
# PowerTransformer
339349
# ----------------
340350
#
@@ -353,6 +363,8 @@ def make_plot(item_idx):
353363
make_plot(6)
354364

355365
# %%
366+
# .. _plot_all_scaling_quantile_transformer_section:
367+
#
356368
# QuantileTransformer (uniform output)
357369
# ------------------------------------
358370
#
@@ -384,6 +396,8 @@ def make_plot(item_idx):
384396
make_plot(8)
385397

386398
# %%
399+
# .. _plot_all_scaling_normalizer_section:
400+
#
387401
# Normalizer
388402
# ----------
389403
#

sklearn/preprocessing/_data.py

Lines changed: 41 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -191,8 +191,7 @@ def scale(X, *, axis=0, with_mean=True, with_std=True, copy=True):
191191
affect model performance.
192192
193193
For a comparison of the different scalers, transformers, and normalizers,
194-
see :ref:`examples/preprocessing/plot_all_scaling.py
195-
<sphx_glr_auto_examples_preprocessing_plot_all_scaling.py>`.
194+
see: :ref:`sphx_glr_auto_examples_preprocessing_plot_all_scaling.py`.
196195
197196
.. warning:: Risk of data leak
198197
@@ -294,6 +293,12 @@ class MinMaxScaler(OneToOneFeatureMixin, TransformerMixin, BaseEstimator):
294293
This transformation is often used as an alternative to zero mean,
295294
unit variance scaling.
296295
296+
`MinMaxScaler` doesn't reduce the effect of outliers, but it linearily
297+
scales them down into a fixed range, where the largest occuring data point
298+
corresponds to the maximum value and the smallest one corresponds to the
299+
minimum value. For an example visualization, refer to :ref:`Compare
300+
MinMaxScaler with other scalers <plot_all_scaling_minmax_scaler_section>`.
301+
297302
Read more in the :ref:`User Guide <preprocessing_scaler>`.
298303
299304
Parameters
@@ -367,10 +372,6 @@ class MinMaxScaler(OneToOneFeatureMixin, TransformerMixin, BaseEstimator):
367372
NaNs are treated as missing values: disregarded in fit, and maintained in
368373
transform.
369374
370-
For a comparison of the different scalers, transformers, and normalizers,
371-
see :ref:`examples/preprocessing/plot_all_scaling.py
372-
<sphx_glr_auto_examples_preprocessing_plot_all_scaling.py>`.
373-
374375
Examples
375376
--------
376377
>>> from sklearn.preprocessing import MinMaxScaler
@@ -641,8 +642,7 @@ def minmax_scale(X, feature_range=(0, 1), *, axis=0, copy=True):
641642
Notes
642643
-----
643644
For a comparison of the different scalers, transformers, and normalizers,
644-
see :ref:`examples/preprocessing/plot_all_scaling.py
645-
<sphx_glr_auto_examples_preprocessing_plot_all_scaling.py>`.
645+
see: :ref:`sphx_glr_auto_examples_preprocessing_plot_all_scaling.py`.
646646
"""
647647
# Unlike the scaler object, this function allows 1d input.
648648
# If copy is required, it will be done inside the scaler object.
@@ -695,6 +695,11 @@ class StandardScaler(OneToOneFeatureMixin, TransformerMixin, BaseEstimator):
695695
than others, it might dominate the objective function and make the
696696
estimator unable to learn from other features correctly as expected.
697697
698+
`StandardScaler` is sensitive to outliers, and the features may scale
699+
differently from each other in the presence of outliers. For an example
700+
visualization, refer to :ref:`Compare StandardScaler with other scalers
701+
<plot_all_scaling_standard_scaler_section>`.
702+
698703
This scaler can also be applied to sparse CSR or CSC matrices by passing
699704
`with_mean=False` to avoid breaking the sparsity structure of the data.
700705
@@ -776,10 +781,6 @@ class StandardScaler(OneToOneFeatureMixin, TransformerMixin, BaseEstimator):
776781
`numpy.std(x, ddof=0)`. Note that the choice of `ddof` is unlikely to
777782
affect model performance.
778783
779-
For a comparison of the different scalers, transformers, and normalizers,
780-
see :ref:`examples/preprocessing/plot_all_scaling.py
781-
<sphx_glr_auto_examples_preprocessing_plot_all_scaling.py>`.
782-
783784
Examples
784785
--------
785786
>>> from sklearn.preprocessing import StandardScaler
@@ -1093,6 +1094,10 @@ class MaxAbsScaler(OneToOneFeatureMixin, TransformerMixin, BaseEstimator):
10931094
10941095
This scaler can also be applied to sparse CSR or CSC matrices.
10951096
1097+
`MaxAbsScaler` doesn't reduce the effect of outliers; it only linearily
1098+
scales them down. For an example visualization, refer to :ref:`Compare
1099+
MaxAbsScaler with other scalers <plot_all_scaling_max_abs_scaler_section>`.
1100+
10961101
.. versionadded:: 0.17
10971102
10981103
Parameters
@@ -1136,10 +1141,6 @@ class MaxAbsScaler(OneToOneFeatureMixin, TransformerMixin, BaseEstimator):
11361141
NaNs are treated as missing values: disregarded in fit, and maintained in
11371142
transform.
11381143
1139-
For a comparison of the different scalers, transformers, and normalizers,
1140-
see :ref:`examples/preprocessing/plot_all_scaling.py
1141-
<sphx_glr_auto_examples_preprocessing_plot_all_scaling.py>`.
1142-
11431144
Examples
11441145
--------
11451146
>>> from sklearn.preprocessing import MaxAbsScaler
@@ -1367,8 +1368,7 @@ def maxabs_scale(X, *, axis=0, copy=True):
13671368
and maintained during the data transformation.
13681369
13691370
For a comparison of the different scalers, transformers, and normalizers,
1370-
see :ref:`examples/preprocessing/plot_all_scaling.py
1371-
<sphx_glr_auto_examples_preprocessing_plot_all_scaling.py>`.
1371+
see: :ref:`sphx_glr_auto_examples_preprocessing_plot_all_scaling.py`.
13721372
"""
13731373
# Unlike the scaler object, this function allows 1d input.
13741374

@@ -1411,11 +1411,13 @@ class RobustScaler(OneToOneFeatureMixin, TransformerMixin, BaseEstimator):
14111411
set. Median and interquartile range are then stored to be used on
14121412
later data using the :meth:`transform` method.
14131413
1414-
Standardization of a dataset is a common requirement for many
1415-
machine learning estimators. Typically this is done by removing the mean
1416-
and scaling to unit variance. However, outliers can often influence the
1417-
sample mean / variance in a negative way. In such cases, the median and
1418-
the interquartile range often give better results.
1414+
Standardization of a dataset is a common preprocessing for many machine
1415+
learning estimators. Typically this is done by removing the mean and
1416+
scaling to unit variance. However, outliers can often influence the sample
1417+
mean / variance in a negative way. In such cases, using the median and the
1418+
interquartile range often give better results. For an example visualization
1419+
and comparison to other scalers, refer to :ref:`Compare RobustScaler with
1420+
other scalers <plot_all_scaling_robust_scaler_section>`.
14191421
14201422
.. versionadded:: 0.17
14211423
@@ -1486,9 +1488,6 @@ class RobustScaler(OneToOneFeatureMixin, TransformerMixin, BaseEstimator):
14861488
14871489
Notes
14881490
-----
1489-
For a comparison of the different scalers, transformers, and normalizers,
1490-
see :ref:`examples/preprocessing/plot_all_scaling.py
1491-
<sphx_glr_auto_examples_preprocessing_plot_all_scaling.py>`.
14921491
14931492
https://en.wikipedia.org/wiki/Median
14941493
https://en.wikipedia.org/wiki/Interquartile_range
@@ -1751,8 +1750,7 @@ def robust_scale(
17511750
To avoid memory copy the caller should pass a CSR matrix.
17521751
17531752
For a comparison of the different scalers, transformers, and normalizers,
1754-
see :ref:`examples/preprocessing/plot_all_scaling.py
1755-
<sphx_glr_auto_examples_preprocessing_plot_all_scaling.py>`.
1753+
see: :ref:`sphx_glr_auto_examples_preprocessing_plot_all_scaling.py`.
17561754
17571755
.. warning:: Risk of data leak
17581756
@@ -1853,8 +1851,7 @@ def normalize(X, norm="l2", *, axis=1, copy=True, return_norm=False):
18531851
Notes
18541852
-----
18551853
For a comparison of the different scalers, transformers, and normalizers,
1856-
see :ref:`examples/preprocessing/plot_all_scaling.py
1857-
<sphx_glr_auto_examples_preprocessing_plot_all_scaling.py>`.
1854+
see: :ref:`sphx_glr_auto_examples_preprocessing_plot_all_scaling.py`.
18581855
"""
18591856
if axis == 0:
18601857
sparse_format = "csc"
@@ -1924,6 +1921,9 @@ class Normalizer(OneToOneFeatureMixin, TransformerMixin, BaseEstimator):
19241921
of the vectors and is the base similarity metric for the Vector
19251922
Space Model commonly used by the Information Retrieval community.
19261923
1924+
For an example visualization, refer to :ref:`Compare Normalizer with other
1925+
scalers <plot_all_scaling_normalizer_section>`.
1926+
19271927
Read more in the :ref:`User Guide <preprocessing_normalization>`.
19281928
19291929
Parameters
@@ -1962,10 +1962,6 @@ class Normalizer(OneToOneFeatureMixin, TransformerMixin, BaseEstimator):
19621962
:meth:`transform`, as parameter validation is only performed in
19631963
:meth:`fit`.
19641964
1965-
For a comparison of the different scalers, transformers, and normalizers,
1966-
see :ref:`examples/preprocessing/plot_all_scaling.py
1967-
<sphx_glr_auto_examples_preprocessing_plot_all_scaling.py>`.
1968-
19691965
Examples
19701966
--------
19711967
>>> from sklearn.preprocessing import Normalizer
@@ -2459,6 +2455,9 @@ class QuantileTransformer(OneToOneFeatureMixin, TransformerMixin, BaseEstimator)
24592455
correlations between variables measured at the same scale but renders
24602456
variables measured at different scales more directly comparable.
24612457
2458+
For example visualizations, refer to :ref:`Compare QuantileTransformer with
2459+
other scalers <plot_all_scaling_quantile_transformer_section>`.
2460+
24622461
Read more in the :ref:`User Guide <preprocessing_transformer>`.
24632462
24642463
.. versionadded:: 0.19
@@ -2536,10 +2535,6 @@ class QuantileTransformer(OneToOneFeatureMixin, TransformerMixin, BaseEstimator)
25362535
NaNs are treated as missing values: disregarded in fit, and maintained in
25372536
transform.
25382537
2539-
For a comparison of the different scalers, transformers, and normalizers,
2540-
see :ref:`examples/preprocessing/plot_all_scaling.py
2541-
<sphx_glr_auto_examples_preprocessing_plot_all_scaling.py>`.
2542-
25432538
Examples
25442539
--------
25452540
>>> import numpy as np
@@ -2988,8 +2983,7 @@ def quantile_transform(
29882983
LogisticRegression())`.
29892984
29902985
For a comparison of the different scalers, transformers, and normalizers,
2991-
see :ref:`examples/preprocessing/plot_all_scaling.py
2992-
<sphx_glr_auto_examples_preprocessing_plot_all_scaling.py>`.
2986+
see: :ref:`sphx_glr_auto_examples_preprocessing_plot_all_scaling.py`.
29932987
29942988
Examples
29952989
--------
@@ -3033,6 +3027,12 @@ class PowerTransformer(OneToOneFeatureMixin, TransformerMixin, BaseEstimator):
30333027
By default, zero-mean, unit-variance normalization is applied to the
30343028
transformed data.
30353029
3030+
For an example visualization, refer to :ref:`Compare PowerTransformer with
3031+
other scalers <plot_all_scaling_power_transformer_section>`. To see the
3032+
effect of Box-Cox and Yeo-Johnson transformations on different
3033+
distributions, see:
3034+
:ref:`sphx_glr_auto_examples_preprocessing_plot_map_data_to_normal.py`.
3035+
30363036
Read more in the :ref:`User Guide <preprocessing_transformer>`.
30373037
30383038
.. versionadded:: 0.20
@@ -3080,10 +3080,6 @@ class PowerTransformer(OneToOneFeatureMixin, TransformerMixin, BaseEstimator):
30803080
NaNs are treated as missing values: disregarded in ``fit``, and maintained
30813081
in ``transform``.
30823082
3083-
For a comparison of the different scalers, transformers, and normalizers,
3084-
see :ref:`examples/preprocessing/plot_all_scaling.py
3085-
<sphx_glr_auto_examples_preprocessing_plot_all_scaling.py>`.
3086-
30873083
References
30883084
----------
30893085
@@ -3500,8 +3496,7 @@ def power_transform(X, method="yeo-johnson", *, standardize=True, copy=True):
35003496
in ``transform``.
35013497
35023498
For a comparison of the different scalers, transformers, and normalizers,
3503-
see :ref:`examples/preprocessing/plot_all_scaling.py
3504-
<sphx_glr_auto_examples_preprocessing_plot_all_scaling.py>`.
3499+
see: :ref:`sphx_glr_auto_examples_preprocessing_plot_all_scaling.py`.
35053500
35063501
References
35073502
----------

sklearn/preprocessing/_discretization.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,9 @@ class KBinsDiscretizer(TransformerMixin, BaseEstimator):
5555
- 'kmeans': Values in each bin have the same nearest center of a 1D
5656
k-means cluster.
5757
58+
For an example of the different strategies see:
59+
:ref:`sphx_glr_auto_examples_preprocessing_plot_discretization_strategies.py`.
60+
5861
dtype : {np.float32, np.float64}, default=None
5962
The desired data-type for the output. If None, output dtype is
6063
consistent with input dtype. Only np.float32 and np.float64 are
@@ -117,6 +120,12 @@ class KBinsDiscretizer(TransformerMixin, BaseEstimator):
117120
118121
Notes
119122
-----
123+
124+
For a visualization of discretization on different datasets refer to
125+
:ref:`sphx_glr_auto_examples_preprocessing_plot_discretization_classification.py`.
126+
On the effect of discretization on linear models see:
127+
:ref:`sphx_glr_auto_examples_preprocessing_plot_discretization.py`.
128+
120129
In bin edges for feature ``i``, the first and last values are used only for
121130
``inverse_transform``. During transform, bin edges are extended to::
122131

sklearn/preprocessing/_encoders.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -463,6 +463,8 @@ class OneHotEncoder(_BaseEncoder):
463463
instead.
464464
465465
Read more in the :ref:`User Guide <preprocessing_categorical_features>`.
466+
For a comparison of different encoders, refer to:
467+
:ref:`sphx_glr_auto_examples_preprocessing_plot_target_encoder.py`.
466468
467469
Parameters
468470
----------
@@ -1243,6 +1245,8 @@ class OrdinalEncoder(OneToOneFeatureMixin, _BaseEncoder):
12431245
a single column of integers (0 to n_categories - 1) per feature.
12441246
12451247
Read more in the :ref:`User Guide <preprocessing_categorical_features>`.
1248+
For a comparison of different encoders, refer to:
1249+
:ref:`sphx_glr_auto_examples_preprocessing_plot_target_encoder.py`.
12461250
12471251
.. versionadded:: 0.20
12481252

sklearn/preprocessing/_target_encoder.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,12 @@ class TargetEncoder(OneToOneFeatureMixin, _BaseEncoder):
2323
that are not seen during :meth:`fit` are encoded with the target mean, i.e.
2424
`target_mean_`.
2525
26-
Read more in the :ref:`User Guide <target_encoder>`.
26+
For a demo on the importance of the `TargetEncoder` internal cross-fitting,
27+
see
28+
ref:`sphx_glr_auto_examples_preprocessing_plot_target_encoder_cross_val.py`.
29+
For a comparison of different encoders, refer to
30+
:ref:`sphx_glr_auto_examples_preprocessing_plot_target_encoder.py`. Read
31+
more in the :ref:`User Guide <target_encoder>`.
2732
2833
.. note::
2934
`fit(X, y).transform(X)` does not equal `fit_transform(X, y)` because a

0 commit comments

Comments
 (0)