Chapter 3: Contents of spam.df don't match output in book

In Chapter 3 we construct a spam filter based on the data in the folder:

`ML_for_Hackers/03-Classification/data/spam`

In the book, the terms in these emails are ordered by occurrence with the command below. The book lists the following table with **html** at the top:

`head(spam.df[with(spam.df, order(-occurrence)),])`

|  | term | frequency | density | occurrence |
| --- | --- | --- | --- | --- |
| 2122 | html | 377 | 0.005665595 | 0.338 |
| 538 | body | 324 | 0.004869105 | 0.298 |
| 4313 | table | 1182 | 0.017763217 | 0.284 |
| 1435 | email | 661 | 0.009933576 | 0.262 |
| 1736 | font | 867 | 0.013029365 | 0.262 |
| 1942 | head | 254 | 0.003817138 | 0.246 |

When running the code directly, this does not match the output I get with email at the top:

|  | term | frequency | density | occurrence |
| --- | --- | --- | --- | --- |
| 7781 | email | 813 | 0.005853680 | 0.566 |
| 18809 | please | 425 | 0.003060042 | 0.508 |
| 14720 | list | 409 | 0.002944840 | 0.444 |
| 27309 | will | 828 | 0.005961681 | 0.422 |
| 3060 | body | 379 | 0.002728837 | 0.408 |
| 9457 | free | 539 | 0.003880853 | 0.390 |

This seems to be explained by the way the document vectors are processed with the `removePunctuation` setting. This punctuation is removed and any terms which were separated would now be a new term. For example, **<html><head>** becomes **htmlhead**. The result is that instead of **html** being listed as a common term in many of the emails, we have lots of low frequency combination of **html** with other HTML tag keywords.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Chapter 3: Contents of spam.df don't match output in book #35

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

	term	frequency	density	occurrence
2122	html	377	0.005665595	0.338
538	body	324	0.004869105	0.298
4313	table	1182	0.017763217	0.284
1435	email	661	0.009933576	0.262
1736	font	867	0.013029365	0.262
1942	head	254	0.003817138	0.246

	term	frequency	density	occurrence
7781	email	813	0.005853680	0.566
18809	please	425	0.003060042	0.508
14720	list	409	0.002944840	0.444
27309	will	828	0.005961681	0.422
3060	body	379	0.002728837	0.408
9457	free	539	0.003880853	0.390

Chapter 3: Contents of spam.df don't match output in book #35

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions