Skip to content

Chapter 3: Contents of spam.df don't match output in book #35

@ChrisHowlin

Description

@ChrisHowlin

In Chapter 3 we construct a spam filter based on the data in the folder:

ML_for_Hackers/03-Classification/data/spam

In the book, the terms in these emails are ordered by occurrence with the command below. The book lists the following table with html at the top:

head(spam.df[with(spam.df, order(-occurrence)),])

term frequency density occurrence
2122 html 377 0.005665595 0.338
538 body 324 0.004869105 0.298
4313 table 1182 0.017763217 0.284
1435 email 661 0.009933576 0.262
1736 font 867 0.013029365 0.262
1942 head 254 0.003817138 0.246

When running the code directly, this does not match the output I get with email at the top:

term frequency density occurrence
7781 email 813 0.005853680 0.566
18809 please 425 0.003060042 0.508
14720 list 409 0.002944840 0.444
27309 will 828 0.005961681 0.422
3060 body 379 0.002728837 0.408
9457 free 539 0.003880853 0.390

This seems to be explained by the way the document vectors are processed with the removePunctuation setting. This punctuation is removed and any terms which were separated would now be a new term. For example, becomes htmlhead. The result is that instead of html being listed as a common term in many of the emails, we have lots of low frequency combination of html with other HTML tag keywords.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions