Replies: 7 comments 6 replies
-
|
For reference, below is my quick-hack |
Beta Was this translation helpful? Give feedback.
-
|
Actually, the dup sentences are a puzzle. I would think they would be labeled correctly all the time, or incorrectly all the time, but instead they seem to not be consistent. This might point to bugs in my first analysis script (which spits out predictions per token for analysis), or my second (some R to count things)? Or maybe the tagger has some state I don't understand? This is all quick hacking. Below is |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
|
Hi @boxydog Thanks for the kind words. I'm pleased that you've been able to dig into the data and make some sense of it. Any suggestions you have for improvements are definitely welcome. It looks like you've found the major problem with NYTimes data - the labelling of sentences is extremely inconsistent and poor. My guess is that it was originally done by a number of different people that didn't have clear guidelines for what was needed. A lot of the work I've put into this project has been cleaning up the first 30,000 sentences to make the labelling consistent. The original NYTimes dataset I used didn't have the PREP label which is why nothing in that dataset after the 30,000th sentence has a PREP labels and why that snapshot of tokens in your first post is full of tokens that should have the PREP label. There is a subtle problem with the training data which affects the accuracy. Due to the way the sentences are labelled in the csv files, if a token appears in a sentence more than once with different labels, there is no way to know which of the possible labels is correct. The |
Beta Was this translation helpful? Give feedback.
-
I don't understand this. I thought recipes themselves could not be copyrighted, but I'm not a lawyer, and the Internet seems vague on the issue. So are you saying only the TASTEset could be added? |
Beta Was this translation helpful? Give feedback.
-
|
Another related question: if we spidered websites for recipes, could we use them on the theory that they cannot be copyrighted? Perhaps that's in a grey area you don't want to explore. |
Beta Was this translation helpful? Give feedback.
-
|
Note for myself: I've also thought about digging into the data used for labeling to see which cuisines are represented. I don't know what this would look like exactly, but presumably it's related to ingredients. Cuisine could represent another axis of dataset diversity. |
Beta Was this translation helpful? Give feedback.



Uh oh!
There was an error while loading. Please reload this page.
-
Thanks for your work on this library. I don't have a particular app in mind yet, but for fun I've been looking around at various ingredient parsing libraries, and this seems like the most thoughtfully engineered to me.
So, also for fun, I kicked the tires a bit to try to figure out how accurate I think it is, and if I can characterize where it might be less accurate. I would be interested in your thoughts.
At a high level, my tentative findings so far:
I'm not going to put my evidence for all of these claims in this post, it would take too long. I'll put my evidence for the first claim below, just to start the conversation, then tease the rest.
Below is my re-run of training, which shows accuracy of 97% of tokens and 92% of sentences:
Below is the output of a separate simplistic script I wrote to measure tokens and accuracy on the datasets using the same sort of "first # lines" behavior as the training. Since I don't know what the training set versus test set was, it's just on the whole set (training + test):
This seems reasonable, the predictor ought to be pretty good on the training set, so these results on training+test (98% token accuracy, 93% sentence accuracy) are similar-ish to the training output.
Below is the result running the same script on the whole dataset (i.e. using the model trained on the first 30000 of each file, but predicting on the entire files):
You can see a big drop: 80% token accuracy, 58% sentence accuracy.
I'm not sure exactly why the rest of the nyt file is less accurate. Perhaps that file is not random, i.e. there are different types of things in different regions of the file.
Another possibility is that the test set leaks into training. In a cursory review of the code, I see no evidence of this.
Now below I put a few quick observations, perhaps without enough context or explanation, as a teaser for future convo.
A table of accuracy by label:
Per source looks like nyt is the trouble (because it has data not in training?):
A snapshot of top 20 tokens with <= 50% accuracy, sorted descending by # occurrences:
A snapshot of 20 examples of sentences containing tokens with <= 50% accuracy where the token is wrong, and not labeled COMMENT or OTHER:
Below is a snapshot of common dup sentences. Better get salt and pepper and olive oil right!
=====
I'll stop here.
Thoughts?
Beta Was this translation helpful? Give feedback.
All reactions