Skip to content

Commit 605b261

Browse files
piroorCh4s3
authored andcommitted
Separate tokenizer from hasher (#162)
* Separate whitespace tokenizer from hasher * Separate stopword filter from hasher * Run tests in deep directories * Separate stemmer from hasher * Separate tests for stopword and tokenizer from hasher's one * Reintroduce method to get hash from clean words * Fix usage of Stopword filter * Add tests for Tokenizer::Token * Add test for TokenFilter::Stemmer * Remove needless conversion * Unite stemmer and stopword filter to whitespace tokenizer * Fix indent * Insert seaparator blank lines between meaningful blocks * Revert "Insert seaparator blank lines between meaningful blocks" This reverts commit 07cf360. Rollback. * Revert "Fix indent" This reverts commit 07e6807. Rollback. * Revert "Unite stemmer and stopword filter to whitespace tokenizer" This reverts commit f256337. They should be used separatedly. * Fix indent * Use meaningful variable name * Describe new modules and classes * Give tokenizer and token filters from outside of hasher * Uniform coding style * Apply enable_stemmer option correctly * Fix invalid URI * Don't give needless parameters * Load required modules * Define default token filters for hasher * Fix path to modules * Add description how to use custom tokenizer * Define token filter to remove symbol only tokens * Fix path to required module * Remove needless parameter * Use langauge option only for stopwords filter * Add test for TokenFilter::Symbol * Remove needless "s" * Add how to use custom token filters * Reject cat token based on regexp * Add tests to custom tokenizer and token filters * Fix usage of custom tokenizer * Add note for custom tokenizer * Describe spec of custom tokenizer at first * Accept lambda as custom token filter and tokenizer * Fix mismatched descriptions about method * Add more tests for custom tokenizer and filters
1 parent 2db156b commit 605b261

File tree

18 files changed

+474
-90
lines changed

18 files changed

+474
-90
lines changed

Rakefile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,15 +21,15 @@ task default: [:test]
2121
desc 'Run all unit tests'
2222
Rake::TestTask.new(:test) do |t|
2323
t.libs << 'lib'
24-
t.pattern = 'test/*/*_test.rb'
24+
t.pattern = 'test/**/*_test.rb'
2525
t.verbose = true
2626
end
2727

2828
# Run benchmarks
2929
desc 'Run all benchmarks'
3030
Rake::TestTask.new(:bench) do |t|
3131
t.libs << 'lib'
32-
t.pattern = 'test/*/*_benchmark.rb'
32+
t.pattern = 'test/**/*_benchmark.rb'
3333
t.verbose = true
3434
end
3535

docs/bayes.md

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -135,6 +135,77 @@ classifier.train("Cat", "I can has cat")
135135
classifier.train("Dog", "I don't always bark at night")
136136
```
137137

138+
## Custom Tokenizer
139+
140+
By default the classifier tokenizes given inputs as a white-space separeted terms.
141+
If you want to use different tokenizer, give it via the `:tokenizer` option.
142+
Tokenizer must be an object having a method named `call`, or a lambda.
143+
The function must return tokens as instances of `ClassifierReborn::Tokenizer::Token`.
144+
145+
```ruby
146+
require 'classifier-reborn'
147+
148+
module BigramTokenizer
149+
module_function
150+
def call(str)
151+
str.each_char
152+
.each_cons(2)
153+
.map do |chars| ClassifierReborn::Tokenizer::Token.new(chars.join) end
154+
end
155+
end
156+
157+
classifier = ClassifierReborn::Bayes.new tokenizer: BigramTokenizer
158+
```
159+
160+
or
161+
162+
```ruby
163+
require 'classifier-reborn'
164+
165+
bigram_tokenizer = lambda do |str|
166+
str.each_char
167+
.each_cons(2)
168+
.map do |chars| ClassifierReborn::Tokenizer::Token.new(chars.join) end
169+
end
170+
171+
classifier = ClassifierReborn::Bayes.new tokenizer: bigram_tokenizer
172+
```
173+
174+
## Custom Token Filters
175+
176+
By default classifier rejects stopwords from tokens.
177+
This behavior is implemented based on filters for tokens.
178+
If you want to use more token filters, give them via the `:token_filter` option.
179+
A filter must be an object having a method named `call`, or a lambda.
180+
181+
```ruby
182+
require 'classifier-reborn'
183+
184+
module CatFilter
185+
module_function
186+
def call(tokens)
187+
tokens.reject do |token|
188+
/cat/i === token
189+
end
190+
end
191+
end
192+
193+
white_filter = lambda do |tokens|
194+
tokens.reject do |token|
195+
/white/i === token
196+
end
197+
done
198+
199+
filters = [
200+
CatFilter,
201+
white_filter
202+
# If you want to reject stopwords too, you need to include stopword filter
203+
# to the list of token filters manually.
204+
ClassifierReborn::TokenFilters::Stopword,
205+
]
206+
classifier = ClassifierReborn::Bayes.new token_filters: filters
207+
```
208+
138209
## Custom Stopwords
139210

140211
The library ships with stopword files in various languages.

lib/classifier-reborn/bayes.rb

Lines changed: 18 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,9 @@
44

55
require 'set'
66

7+
require_relative 'extensions/tokenizer/whitespace'
8+
require_relative 'extensions/token_filter/stopword'
9+
require_relative 'extensions/token_filter/stemmer'
710
require_relative 'category_namer'
811
require_relative 'backends/bayes_memory_backend'
912
require_relative 'backends/bayes_redis_backend'
@@ -50,6 +53,14 @@ def initialize(*args)
5053
@threshold = options[:threshold]
5154
@enable_stemmer = options[:enable_stemmer]
5255
@backend = options[:backend]
56+
@tokenizer = options[:tokenizer] || Tokenizer::Whitespace
57+
@token_filters = options[:token_filters] || [TokenFilter::Stopword]
58+
if @enable_stemmer && !@token_filters.include?(TokenFilter::Stemmer)
59+
@token_filters << TokenFilter::Stemmer
60+
end
61+
if @token_filters.include?(TokenFilter::Stopword)
62+
TokenFilter::Stopword.language = @language
63+
end
5364

5465
populate_initial_categories
5566

@@ -65,7 +76,8 @@ def initialize(*args)
6576
# b.train "that", "That text"
6677
# b.train "The other", "The other text"
6778
def train(category, text)
68-
word_hash = Hasher.word_hash(text, @language, @enable_stemmer)
79+
word_hash = Hasher.word_hash(text, @enable_stemmer,
80+
tokenizer: @tokenizer, token_filters: @token_filters)
6981
return if word_hash.empty?
7082
category = CategoryNamer.prepare_name(category)
7183

@@ -95,7 +107,8 @@ def train(category, text)
95107
# b.train :this, "This text"
96108
# b.untrain :this, "This text"
97109
def untrain(category, text)
98-
word_hash = Hasher.word_hash(text, @language, @enable_stemmer)
110+
word_hash = Hasher.word_hash(text, @enable_stemmer,
111+
tokenizer: @tokenizer, token_filters: @token_filters)
99112
return if word_hash.empty?
100113
category = CategoryNamer.prepare_name(category)
101114
word_hash.each do |word, count|
@@ -120,7 +133,8 @@ def untrain(category, text)
120133
# The largest of these scores (the one closest to 0) is the one picked out by #classify
121134
def classifications(text)
122135
score = {}
123-
word_hash = Hasher.word_hash(text, @language, @enable_stemmer)
136+
word_hash = Hasher.word_hash(text, @enable_stemmer,
137+
tokenizer: @tokenizer, token_filters: @token_filters)
124138
if word_hash.empty?
125139
category_keys.each do |category|
126140
score[category.to_s] = Float::INFINITY
@@ -266,7 +280,7 @@ def custom_stopwords(stopwords)
266280
return # Do not overwrite the default
267281
end
268282
end
269-
Hasher::STOPWORDS[@language] = Set.new stopwords
283+
TokenFilter::Stopword::STOPWORDS[@language] = Set.new stopwords
270284
end
271285
end
272286
end

lib/classifier-reborn/extensions/hasher.rb

Lines changed: 18 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -5,63 +5,37 @@
55

66
require 'set'
77

8+
require_relative 'tokenizer/whitespace'
9+
require_relative 'token_filter/stopword'
10+
require_relative 'token_filter/stemmer'
11+
812
module ClassifierReborn
913
module Hasher
10-
STOPWORDS_PATH = [File.expand_path(File.dirname(__FILE__) + '/../../../data/stopwords')]
11-
1214
module_function
1315

1416
# Return a Hash of strings => ints. Each word in the string is stemmed,
1517
# interned, and indexes to its frequency in the document.
16-
def word_hash(str, language = 'en', enable_stemmer = true)
17-
cleaned_word_hash = clean_word_hash(str, language, enable_stemmer)
18-
symbol_hash = word_hash_for_symbols(str.scan(/[^\s\p{WORD}]/))
19-
cleaned_word_hash.merge(symbol_hash)
20-
end
21-
22-
# Return a word hash without extra punctuation or short symbols, just stemmed words
23-
def clean_word_hash(str, language = 'en', enable_stemmer = true)
24-
word_hash_for_words(str.gsub(/[^\p{WORD}\s]/, '').downcase.split, language, enable_stemmer)
25-
end
26-
27-
def word_hash_for_words(words, language = 'en', enable_stemmer = true)
28-
d = Hash.new(0)
29-
words.each do |word|
30-
next unless word.length > 2 && !STOPWORDS[language].include?(word)
31-
if enable_stemmer
32-
d[word.stem.intern] += 1
33-
else
34-
d[word.intern] += 1
18+
def word_hash(str, enable_stemmer = true,
19+
tokenizer: Tokenizer::Whitespace,
20+
token_filters: [TokenFilter::Stopword])
21+
if token_filters.include?(TokenFilter::Stemmer)
22+
unless enable_stemmer
23+
token_filters.reject! do |token_filter|
24+
token_filter == TokenFilter::Stemmer
25+
end
3526
end
27+
else
28+
token_filters << TokenFilter::Stemmer if enable_stemmer
29+
end
30+
words = tokenizer.call(str)
31+
token_filters.each do |token_filter|
32+
words = token_filter.call(words)
3633
end
37-
d
38-
end
39-
40-
# Add custom path to a new stopword file created by user
41-
def add_custom_stopword_path(path)
42-
STOPWORDS_PATH.unshift(path)
43-
end
44-
45-
def word_hash_for_symbols(words)
4634
d = Hash.new(0)
4735
words.each do |word|
4836
d[word.intern] += 1
4937
end
5038
d
5139
end
52-
53-
# Create a lazily-loaded hash of stopword data
54-
STOPWORDS = Hash.new do |hash, language|
55-
hash[language] = []
56-
57-
STOPWORDS_PATH.each do |path|
58-
if File.exist?(File.join(path, language))
59-
hash[language] = Set.new File.read(File.join(path, language.to_s)).force_encoding("utf-8").split
60-
break
61-
end
62-
end
63-
64-
hash[language]
65-
end
6640
end
6741
end
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# encoding: utf-8
2+
# Author:: Lucas Carlson (mailto:lucas@rufy.com)
3+
# Copyright:: Copyright (c) 2005 Lucas Carlson
4+
# License:: LGPL
5+
6+
module ClassifierReborn
7+
module TokenFilter
8+
# This filter converts given tokens to their stemmed versions.
9+
module Stemmer
10+
module_function
11+
12+
def call(tokens)
13+
tokens.collect do |token|
14+
if token.stemmable?
15+
token.stem
16+
else
17+
token
18+
end
19+
end
20+
end
21+
end
22+
end
23+
end
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
# encoding: utf-8
2+
# Author:: Lucas Carlson (mailto:lucas@rufy.com)
3+
# Copyright:: Copyright (c) 2005 Lucas Carlson
4+
# License:: LGPL
5+
6+
module ClassifierReborn
7+
module TokenFilter
8+
# This filter removes stopwords in the language, from given tokens.
9+
module Stopword
10+
STOPWORDS_PATH = [File.expand_path(File.dirname(__FILE__) + '/../../../../data/stopwords')]
11+
@language = 'en'
12+
13+
module_function
14+
15+
def call(tokens)
16+
tokens.reject do |token|
17+
token.maybe_stopword? &&
18+
(token.length <= 2 || STOPWORDS[@language].include?(token))
19+
end
20+
end
21+
22+
# Add custom path to a new stopword file created by user
23+
def add_custom_stopword_path(path)
24+
STOPWORDS_PATH.unshift(path)
25+
end
26+
27+
# Create a lazily-loaded hash of stopword data
28+
STOPWORDS = Hash.new do |hash, language|
29+
hash[language] = []
30+
31+
STOPWORDS_PATH.each do |path|
32+
if File.exist?(File.join(path, language))
33+
hash[language] = Set.new File.read(File.join(path, language.to_s)).force_encoding("utf-8").split
34+
break
35+
end
36+
end
37+
38+
hash[language]
39+
end
40+
41+
# Changes the language of stopwords
42+
def language=(language)
43+
@language = language
44+
end
45+
end
46+
end
47+
end
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# encoding: utf-8
2+
# Author:: Lucas Carlson (mailto:lucas@rufy.com)
3+
# Copyright:: Copyright (c) 2005 Lucas Carlson
4+
# License:: LGPL
5+
6+
module ClassifierReborn
7+
module TokenFilter
8+
# This filter removes symbol-only terms, from given tokens.
9+
module Symbol
10+
module_function
11+
12+
def call(tokens)
13+
tokens.reject do |token|
14+
/[^\s\p{WORD}]/ === token
15+
end
16+
end
17+
end
18+
end
19+
end
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# encoding: utf-8
2+
# Author:: Lucas Carlson (mailto:lucas@rufy.com)
3+
# Copyright:: Copyright (c) 2005 Lucas Carlson
4+
# License:: LGPL
5+
6+
module ClassifierReborn
7+
module Tokenizer
8+
class Token < String
9+
# The class can be created with one token string and extra attributes. E.g.,
10+
# t = ClassifierReborn::Tokenizer::Token.new 'Tokenize', stemmable: true, maybe_stopword: false
11+
#
12+
# Attributes available are:
13+
# stemmable: true Possibility that the token can be stemmed. This must be false for un-stemmable terms, otherwise this should be true.
14+
# maybe_stopword: true Possibility that the token is a stopword. This must be false for terms which never been stopword, otherwise this should be true.
15+
def initialize(string, stemmable: true, maybe_stopword: true)
16+
super(string)
17+
@stemmable = stemmable
18+
@maybe_stopword = maybe_stopword
19+
end
20+
21+
def stemmable?
22+
@stemmable
23+
end
24+
25+
def maybe_stopword?
26+
@maybe_stopword
27+
end
28+
29+
def stem
30+
stemmed = super
31+
self.class.new(stemmed, stemmable: @stemmable, maybe_stopword: @maybe_stopword)
32+
end
33+
end
34+
end
35+
end
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# encoding: utf-8
2+
# Author:: Lucas Carlson (mailto:lucas@rufy.com)
3+
# Copyright:: Copyright (c) 2005 Lucas Carlson
4+
# License:: LGPL
5+
6+
require_relative 'token'
7+
8+
module ClassifierReborn
9+
module Tokenizer
10+
# This tokenizes given input as white-space separated terms.
11+
# It mainly aims to tokenize sentences written with a space between words, like English, French, and others.
12+
module Whitespace
13+
module_function
14+
15+
def call(str)
16+
tokens = str.gsub(/[^\p{WORD}\s]/, '').downcase.split.collect do |word|
17+
Token.new(word, stemmable: true, maybe_stopword: true)
18+
end
19+
symbol_tokens = str.scan(/[^\s\p{WORD}]/).collect do |word|
20+
Token.new(word, stemmable: false, maybe_stopword: false)
21+
end
22+
tokens += symbol_tokens
23+
tokens
24+
end
25+
end
26+
end
27+
end

0 commit comments

Comments
 (0)