Skip to content

Commit 0b6803f

Browse files
committed
Merge main
Signed-off-by: Adam Li <adam2392@gmail.com>
2 parents e2fee00 + 0d701e8 commit 0b6803f

File tree

76 files changed

+2176
-2183
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

76 files changed

+2176
-2183
lines changed

README.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
.. |Nightly wheels| image:: https://github.com/scikit-learn/scikit-learn/workflows/Wheel%20builder/badge.svg?event=schedule
1818
.. _`Nightly wheels`: https://github.com/scikit-learn/scikit-learn/actions?query=workflow%3A%22Wheel+builder%22+event%3Aschedule
1919

20-
.. |PythonVersion| image:: https://img.shields.io/badge/python-3.8%20%7C%203.9%20%7C%203.10-blue
20+
.. |PythonVersion| image:: https://img.shields.io/pypi/pyversions/scikit-learn.svg
2121
.. _PythonVersion: https://pypi.org/project/scikit-learn/
2222

2323
.. |PyPi| image:: https://img.shields.io/pypi/v/scikit-learn

doc/conftest.py

Lines changed: 35 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,15 @@
33
from os import environ
44
from os.path import exists, join
55

6+
import pytest
7+
from _pytest.doctest import DoctestItem
8+
69
from sklearn.datasets import get_data_home
710
from sklearn.datasets._base import _pkl_filepath
811
from sklearn.datasets._twenty_newsgroups import CACHE_NAME
912
from sklearn.utils import IS_PYPY
1013
from sklearn.utils._testing import SkipTest, check_skip_network
11-
from sklearn.utils.fixes import parse_version
14+
from sklearn.utils.fixes import np_base_version, parse_version
1215

1316

1417
def setup_labeled_faces():
@@ -172,3 +175,34 @@ def pytest_configure(config):
172175
matplotlib.use("agg")
173176
except ImportError:
174177
pass
178+
179+
180+
def pytest_collection_modifyitems(config, items):
181+
"""Called after collect is completed.
182+
183+
Parameters
184+
----------
185+
config : pytest config
186+
items : list of collected items
187+
"""
188+
skip_doctests = False
189+
if np_base_version >= parse_version("2"):
190+
# Skip doctests when using numpy 2 for now. See the following discussion
191+
# to decide what to do in the longer term:
192+
# https://github.com/scikit-learn/scikit-learn/issues/27339
193+
reason = "Due to NEP 51 numpy scalar repr has changed in numpy 2"
194+
skip_doctests = True
195+
196+
# Normally doctest has the entire module's scope. Here we set globs to an empty dict
197+
# to remove the module's scope:
198+
# https://docs.python.org/3/library/doctest.html#what-s-the-execution-context
199+
for item in items:
200+
if isinstance(item, DoctestItem):
201+
item.dtest.globs = {}
202+
203+
if skip_doctests:
204+
skip_marker = pytest.mark.skip(reason=reason)
205+
206+
for item in items:
207+
if isinstance(item, DoctestItem):
208+
item.add_marker(skip_marker)

doc/glossary.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1731,7 +1731,7 @@ functions or non-estimator constructors.
17311731
For these models, the number of iterations, reported via
17321732
``len(estimators_)`` or ``n_iter_``, corresponds the total number of
17331733
estimators/iterations learnt since the initialization of the model.
1734-
Thus, if a model was already initialized with `N`` estimators, and `fit`
1734+
Thus, if a model was already initialized with `N` estimators, and `fit`
17351735
is called with ``n_estimators`` or ``max_iter`` set to `M`, the model
17361736
will train `M - N` new estimators.
17371737

doc/modules/density.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,10 @@ forms, which are shown in the following figure:
113113

114114
.. centered:: |kde_kernels|
115115

116+
|details-start|
117+
**kernels' mathematical expressions**
118+
|details-split|
119+
116120
The form of these kernels is as follows:
117121

118122
* Gaussian kernel (``kernel = 'gaussian'``)
@@ -139,6 +143,8 @@ The form of these kernels is as follows:
139143

140144
:math:`K(x; h) \propto \cos(\frac{\pi x}{2h})` if :math:`x < h`
141145

146+
|details-end|
147+
142148
The kernel density estimator can be used with any of the valid distance
143149
metrics (see :class:`~sklearn.metrics.DistanceMetric` for a list of
144150
available metrics), though the results are properly normalized only

doc/modules/feature_extraction.rst

Lines changed: 35 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -206,8 +206,9 @@ Note the use of a generator comprehension,
206206
which introduces laziness into the feature extraction:
207207
tokens are only processed on demand from the hasher.
208208

209-
Implementation details
210-
----------------------
209+
|details-start|
210+
**Implementation details**
211+
|details-split|
211212

212213
:class:`FeatureHasher` uses the signed 32-bit variant of MurmurHash3.
213214
As a result (and because of limitations in ``scipy.sparse``),
@@ -223,16 +224,18 @@ Since a simple modulo is used to transform the hash function to a column index,
223224
it is advisable to use a power of two as the ``n_features`` parameter;
224225
otherwise the features will not be mapped evenly to the columns.
225226

227+
.. topic:: References:
228+
229+
* `MurmurHash3 <https://github.com/aappleby/smhasher>`_.
230+
231+
|details-end|
226232

227233
.. topic:: References:
228234

229235
* Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola and
230236
Josh Attenberg (2009). `Feature hashing for large scale multitask learning
231237
<https://alex.smola.org/papers/2009/Weinbergeretal09.pdf>`_. Proc. ICML.
232238

233-
* `MurmurHash3 <https://github.com/aappleby/smhasher>`_.
234-
235-
236239
.. _text_feature_extraction:
237240

238241
Text feature extraction
@@ -395,8 +398,9 @@ last document::
395398

396399
.. _stop_words:
397400

398-
Using stop words
399-
................
401+
|details-start|
402+
**Using stop words**
403+
|details-split|
400404

401405
Stop words are words like "and", "the", "him", which are presumed to be
402406
uninformative in representing the content of a text, and which may be
@@ -426,6 +430,9 @@ identify and warn about some kinds of inconsistencies.
426430
<https://aclweb.org/anthology/W18-2502>`__.
427431
In *Proc. Workshop for NLP Open Source Software*.
428432
433+
434+
|details-end|
435+
429436
.. _tfidf:
430437

431438
Tf–idf term weighting
@@ -490,6 +497,10 @@ class::
490497
Again please see the :ref:`reference documentation
491498
<text_feature_extraction_ref>` for the details on all the parameters.
492499

500+
|details-start|
501+
**Numeric example of a tf-idf matrix**
502+
|details-split|
503+
493504
Let's take an example with the following counts. The first term is present
494505
100% of the time hence not very interesting. The two other features only
495506
in less than 50% of the time hence probably more representative of the
@@ -609,6 +620,7 @@ feature extractor with a classifier:
609620

610621
* :ref:`sphx_glr_auto_examples_model_selection_plot_grid_search_text_feature_extraction.py`
611622

623+
|details-end|
612624

613625
Decoding text files
614626
-------------------
@@ -637,6 +649,10 @@ or ``"replace"``. See the documentation for the Python function
637649
``bytes.decode`` for more details
638650
(type ``help(bytes.decode)`` at the Python prompt).
639651

652+
|details-start|
653+
**Troubleshooting decoding text**
654+
|details-split|
655+
640656
If you are having trouble decoding text, here are some things to try:
641657

642658
- Find out what the actual encoding of the text is. The file might come
@@ -690,6 +706,7 @@ About Unicode <https://www.joelonsoftware.com/articles/Unicode.html>`_.
690706

691707
.. _`ftfy`: https://github.com/LuminosoInsight/python-ftfy
692708

709+
|details-end|
693710

694711
Applications and examples
695712
-------------------------
@@ -870,8 +887,9 @@ The :class:`HashingVectorizer` also comes with the following limitations:
870887
model. A :class:`TfidfTransformer` can be appended to it in a pipeline if
871888
required.
872889

873-
Performing out-of-core scaling with HashingVectorizer
874-
------------------------------------------------------
890+
|details-start|
891+
**Performing out-of-core scaling with HashingVectorizer**
892+
|details-split|
875893

876894
An interesting development of using a :class:`HashingVectorizer` is the ability
877895
to perform `out-of-core`_ scaling. This means that we can learn from data that
@@ -890,6 +908,8 @@ time is often limited by the CPU time one wants to spend on the task.
890908
For a full-fledged example of out-of-core scaling in a text classification
891909
task see :ref:`sphx_glr_auto_examples_applications_plot_out_of_core_classification.py`.
892910

911+
|details-end|
912+
893913
Customizing the vectorizer classes
894914
----------------------------------
895915

@@ -928,6 +948,10 @@ parameters it is possible to derive from the class and override the
928948
``build_preprocessor``, ``build_tokenizer`` and ``build_analyzer``
929949
factory methods instead of passing custom functions.
930950

951+
|details-start|
952+
**Tips and tricks**
953+
|details-split|
954+
931955
Some tips and tricks:
932956

933957
* If documents are pre-tokenized by an external package, then store them in
@@ -982,6 +1006,8 @@ Some tips and tricks:
9821006
Customizing the vectorizer can also be useful when handling Asian languages
9831007
that do not use an explicit word separator such as whitespace.
9841008

1009+
|details-end|
1010+
9851011
.. _image_feature_extraction:
9861012

9871013
Image feature extraction

0 commit comments

Comments
 (0)