Skip to content

Commit 756c81e

Browse files
authored
Update README.md
1 parent b65494b commit 756c81e

File tree

1 file changed

+3
-3
lines changed

1 file changed

+3
-3
lines changed

examples/amazon_reviews_dataset/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,21 +3,21 @@
33
We released pre-prepared version of the Amazon5core reviews dataset.
44
Download it here: https://github.com/cgnorthcutt/label-errors/releases/tag/amazon-reviews-dataset
55

6-
From the Amazon 5core dataset (40+ million examples), select only the data that adheres to:
6+
From the Amazon 5core dataset (40+ million examples), we select only the data that adheres to:
77
1. non-empty reviews.
88
2. label must be 1 star, 3 stars, or 5 stars. (2 and 4 star reviews are removed)
99
3. Only consider reviews with more than upvotes than downvotes (and at least one upvote).
1010

1111
You should have about 10 million examples left-over. These are higher quality, which will allow us to have more control over noise in the labels (instead of just general noise in the text itself).
1212

13-
Pre-process the data for reading by fast text. Here are the first two lines of my formatted training data file:
13+
The dataset has been formatted in [fastext format](https://fasttext.cc/docs/en/supervised-tutorial.html#getting-and-preparing-the-data) for you. Here are the first two lines of my formatted training data file:
1414

1515
```
1616
__label__5 I bought this for my husband who plays the piano. He is having a wonderful time playing these old hymns. The music is at times hard to read because we think the book was published for singing from more than playing from. Great purchase though!
1717
__label__4 This work bears deep connections to themes first explored in Tad Williams' original breakout novel, about a brave young cat who travels to an underground netherworld to face an ancient evil. As the owner of two cats myself, after the second read-through, I realized that this novel has much to teach about the critical importance of dealing with fur and dust.I could only give four stars, though, because the cats do not agree, and indeed wish I had not made this purchase.
1818
```
1919

20-
Pre-process the training data as follows:
20+
When training, we pre-process the training data as follows:
2121

2222
```bash
2323
cat amazon5core.txt | sed -e "s/\([.\!?,'/()]\)/ \1 /g" | tr "[:upper:]" "[:lower:]" > amazon5core.preprocessed.txt

0 commit comments

Comments
 (0)