Skip to content

Commit 1b2a928

Browse files
edrogersjnothman
authored andcommitted
[MRG] Documenting char_wb padding functionality (Issue scikit-learn#8694) (scikit-learn#8803)
* Documenting char_wb padding functionality (Issue scikit-learn#8694) * Small fix: change of wording. * 's/passed with space/padded with space/g'
1 parent 719afba commit 1b2a928

File tree

1 file changed

+4
-3
lines changed

1 file changed

+4
-3
lines changed

sklearn/feature_extraction/text.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -159,7 +159,8 @@ def _char_wb_ngrams(self, text_document):
159159
"""Whitespace sensitive char-n-gram tokenization.
160160
161161
Tokenize text_document into a sequence of character n-grams
162-
excluding any whitespace (operating only inside word boundaries)"""
162+
operating only inside word boundaries. n-grams at the edges
163+
of words are padded with space."""
163164
# normalize white spaces
164165
text_document = self._white_spaces.sub(" ", text_document)
165166

@@ -354,7 +355,7 @@ class HashingVectorizer(BaseEstimator, VectorizerMixin):
354355
analyzer : string, {'word', 'char', 'char_wb'} or callable
355356
Whether the feature should be made of word or character n-grams.
356357
Option 'char_wb' creates character n-grams only from text inside
357-
word boundaries.
358+
word boundaries; n-grams at the edges of words are padded with space.
358359
359360
If a callable is passed it is used to extract the sequence of features
360361
out of the raw, unprocessed input.
@@ -553,7 +554,7 @@ class CountVectorizer(BaseEstimator, VectorizerMixin):
553554
analyzer : string, {'word', 'char', 'char_wb'} or callable
554555
Whether the feature should be made of word or character n-grams.
555556
Option 'char_wb' creates character n-grams only from text inside
556-
word boundaries.
557+
word boundaries; n-grams at the edges of words are padded with space.
557558
558559
If a callable is passed it is used to extract the sequence of features
559560
out of the raw, unprocessed input.

0 commit comments

Comments
 (0)