-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
In Chapter 3 we construct a spam filter based on the data in the folder:
ML_for_Hackers/03-Classification/data/spam
In the book, the terms in these emails are ordered by occurrence with the command below. The book lists the following table with html at the top:
head(spam.df[with(spam.df, order(-occurrence)),])
term | frequency | density | occurrence | |
---|---|---|---|---|
2122 | html | 377 | 0.005665595 | 0.338 |
538 | body | 324 | 0.004869105 | 0.298 |
4313 | table | 1182 | 0.017763217 | 0.284 |
1435 | 661 | 0.009933576 | 0.262 | |
1736 | font | 867 | 0.013029365 | 0.262 |
1942 | head | 254 | 0.003817138 | 0.246 |
When running the code directly, this does not match the output I get with email at the top:
term | frequency | density | occurrence | |
---|---|---|---|---|
7781 | 813 | 0.005853680 | 0.566 | |
18809 | please | 425 | 0.003060042 | 0.508 |
14720 | list | 409 | 0.002944840 | 0.444 |
27309 | will | 828 | 0.005961681 | 0.422 |
3060 | body | 379 | 0.002728837 | 0.408 |
9457 | free | 539 | 0.003880853 | 0.390 |
This seems to be explained by the way the document vectors are processed with the removePunctuation
setting. This punctuation is removed and any terms which were separated would now be a new term. For example, becomes htmlhead. The result is that instead of html being listed as a common term in many of the emails, we have lots of low frequency combination of html with other HTML tag keywords.