Thoughts about accuracy #6

boxydog · 2024-01-07T22:56:40Z

boxydog
Jan 7, 2024

Thanks for your work on this library. I don't have a particular app in mind yet, but for fun I've been looking around at various ingredient parsing libraries, and this seems like the most thoughtfully engineered to me.

So, also for fun, I kicked the tires a bit to try to figure out how accurate I think it is, and if I can characterize where it might be less accurate. I would be interested in your thoughts.

At a high level, my tentative findings so far:

Accuracy on the whole dataset is lower than on the training+test dataset.
- Possibly use more data, both in training and in testing?
COMMENT and OTHER are the most inaccurate, the others still look pretty good
the most obvious type of common inaccurate token looks like prep instructions (e.g., "chopped", "peeled", "minced", "sliced", ..), but maybe the labels are bad, and not the predictions. For example, sometimes the token is labeled "COMMENT" but I think it should be labeled "PREP". I haven't quantified this yet.
I have lots of interesting examples of edge cases (e.g., "3/4 pound cut ziti" where "cut ziti" is a name, but likely "cut" should be PREP).
- Possibly excerpt some edge cases to try to get right into a separate file for measurement?
Looking at the most common dup sentences suggests maybe some situations to try to get right (since perhaps they occur in many recipes)?

I'm not going to put my evidence for all of these claims in this post, it would take too long. I'll put my evidence for the first claim below, just to start the conversation, then tease the rest.

Below is my re-run of training, which shows accuracy of 97% of tokens and 92% of sentences:

$ time python train.py train --datasets train/data/nytimes/nyt-ingredients-snapshot-2015.csv train/data/bbc/bbc-ingredients-snapshot-2017-clean.csv train/data/cookstr/cookstr-ingredients-snapshot-2017-clean.csv
[INFO] Loading and transforming training data.
[INFO] Loaded 30000 vectors from nyt-ingredients-snapshot-2015.csv.
[INFO] Transforming 'nyt' vectors.
[INFO] Loaded 10000 vectors from bbc-ingredients-snapshot-2017-clean.csv.
[INFO] Transforming 'bbc' vectors.
[INFO] Loaded 10001 vectors from cookstr-ingredients-snapshot-2017-clean.csv.
[INFO] Transforming 'cookstr' vectors.
[INFO] 50,001 total vectors
[INFO] 37,500 training vectors.
[INFO] 12,501 testing vectors.
[INFO] Training model with training data.
[INFO] Evaluating model with test data.
OTHER labels:
	In training data: 70 (0.03%)
	In test data: 20 (0.02%)
	Predicted in test data: 1 (0.00%)

Sentence-level results:
	Total: 12501
	Correct: 11480
	Incorrect: 1021
	-> 91.83% correct

Word-level results:
	Total: 85279
	Correct: 82649
	Incorrect: 2630
	-> 96.92% correct

Below is the output of a separate simplistic script I wrote to measure tokens and accuracy on the datasets using the same sort of "first # lines" behavior as the training. Since I don't know what the training set versus test set was, it's just on the whole set (training + test):

$ ./inspect_results2.py 
[INFO] Loading and transforming training data.
[INFO] Loaded 30000 vectors from nyt-ingredients-snapshot-2015.csv.
[INFO] Transforming 'nyt' vectors.
[INFO] Loaded 10000 vectors from bbc-ingredients-snapshot-2017-clean.csv.
[INFO] Transforming 'bbc' vectors.
[INFO] Loaded 10001 vectors from cookstr-ingredients-snapshot-2017-clean.csv.
[INFO] Transforming 'cookstr' vectors.
[INFO] 50,001 total vectors
46704 sentences correct (93.4 %) of 50001
333050 tokens correct (97.7 %) of 340896

This seems reasonable, the predictor ought to be pretty good on the training set, so these results on training+test (98% token accuracy, 93% sentence accuracy) are similar-ish to the training output.

Below is the result running the same script on the whole dataset (i.e. using the model trained on the first 30000 of each file, but predicting on the entire files):

./inspect_results2.py 
[INFO] Loading and transforming training data.
[INFO] Loaded 178547 vectors from nyt-ingredients-snapshot-2015.csv.
[INFO] Transforming 'nyt' vectors.
[INFO] Loaded 10000 vectors from bbc-ingredients-snapshot-2017-clean.csv.
[INFO] Transforming 'bbc' vectors.
[INFO] Loaded 10001 vectors from cookstr-ingredients-snapshot-2017-clean.csv.
[INFO] Transforming 'cookstr' vectors.
[INFO] 198,548 total vectors
115387 sentences correct (58.1 %) of 198548
1000088 tokens correct (79.6 %) of 1256902

You can see a big drop: 80% token accuracy, 58% sentence accuracy.

I'm not sure exactly why the rest of the nyt file is less accurate. Perhaps that file is not random, i.e. there are different types of things in different regions of the file.

Another possibility is that the test set leaks into training. In a cursory review of the code, I see no evidence of this.

Now below I put a few quick observations, perhaps without enough context or explanation, as a teaser for future convo.

A table of accuracy by label:

> data %>% group_by(label) %>%
+   summarize(num=length(label), frac_correct = sum(label==tagged_label) / length(label))
  label      num frac_correct
1 COMMA    77719     1       
2 COMMENT 327655     0.501   
3 NAME    381894     0.962   
4 OTHER    75199     0.000559
5 PREP     52274     0.967   
6 QTY     182226     0.998   
7 UNIT    159935     0.992

Per source looks like nyt is the trouble (because it has data not in training?):

> data %>% group_by(source) %>%
+   summarize(num_tokens=length(source),
+             num_sentences=length(unique(sentence_num)),
+             frac_correct = sum(label==tagged_label) / length(label))
  source  num_tokens num_sentences frac_correct
1 bbc          83335         10000        0.989
2 cookstr      82150         10001        0.964
3 nyt        1091417        178547        0.768

A snapshot of top 20 tokens with <= 50% accuracy, sorted descending by # occurrences:

A snapshot of 20 examples of sentences containing tokens with <= 50% accuracy where the token is wrong, and not labeled COMMENT or OTHER:

Below is a snapshot of common dup sentences. Better get salt and pepper and olive oil right!

dup_sentences <- data %>% group_by(sentence_num, sentence) %>%
  summarize() %>%
  group_by(sentence) %>%
  summarize(num=length(sentence)) %>%
  filter(num > 1) %>%
  arrange(desc(num))

=====

I'll stop here.

Thoughts?

boxydog · 2024-01-07T23:01:55Z

boxydog
Jan 7, 2024
Author

For reference, below is my quick-hack inspect_results2.py.

#!/usr/bin/env python

"""Take training input, and show label and prediction for each example."""

from ingredient_parser import PreProcessor
from ingredient_parser.parsers import TAGGER
from train.training_utils import load_datasets


def main():
    datasets = [
        "data/nytimes/nyt-ingredients-snapshot-2015.csv",
        "data/bbc/bbc-ingredients-snapshot-2017-clean.csv",
        "data/cookstr/cookstr-ingredients-snapshot-2017-clean.csv",
    ]
    number = 30000
    vectors = load_datasets(datasets, number)

    num_sentences = 0
    num_correct_sentences = 0
    num_tokens = 0
    num_correct_tokens = 0
    for sentence, labels, source in zip(
        vectors.sentences, vectors.labels, vectors.source, strict=True
    ):
        num_sentences += 1

        processed_sentence = PreProcessor(sentence)
        tokens = processed_sentence.tokenized_sentence
        tagged_labels = TAGGER.tag(processed_sentence.sentence_features())

        if labels == tagged_labels:
            num_correct_sentences += 1

        for label, tagged_label, token in zip(
            labels, tagged_labels, tokens, strict=True
        ):
            num_tokens += 1
            if label == tagged_label:
                num_correct_tokens += 1

    perc_correct_sentences = f"{num_correct_sentences/num_sentences*100:.1f}"
    print(
        f"{num_correct_sentences} sentences correct"
        f" ({perc_correct_sentences} %) of {num_sentences}"
    )

    perc_correct_tokens = f"{num_correct_tokens/num_tokens*100:.1f}"
    print(
        f"{num_correct_tokens} tokens correct"
        f" ({perc_correct_tokens} %) of {num_tokens}"
    )


main()

0 replies

boxydog · 2024-01-07T23:07:10Z

boxydog
Jan 7, 2024
Author

Actually, the dup sentences are a puzzle. I would think they would be labeled correctly all the time, or incorrectly all the time, but instead they seem to not be consistent. This might point to bugs in my first analysis script (which spits out predictions per token for analysis), or my second (some R to count things)? Or maybe the tagger has some state I don't understand?

This is all quick hacking.

Below is inspect_results.py that I use to spit out the tagged results for deeper analysis. Note it doesn't post-process (I didn't see exactly how to do it).

#!/usr/bin/env python

"""Take training input, and show label and prediction for each example."""

import csv
import sys

from ingredient_parser import PreProcessor
from ingredient_parser.parsers import TAGGER
from train.training_utils import load_datasets


def main():
    datasets = [
        "data/nytimes/nyt-ingredients-snapshot-2015.csv",
        "data/bbc/bbc-ingredients-snapshot-2017-clean.csv",
        "data/cookstr/cookstr-ingredients-snapshot-2017-clean.csv",
    ]
    number = 1000000
    vectors = load_datasets(datasets, number)
    writer = csv.writer(sys.stdout)
    num_sentences = 0
    num_correct_sentences = 0
    writer.writerow(
        [
            "sentence_num",
            "source",
            "sentence",
            "token_num",
            "label",
            "tagged_label",
            "token",
        ]
    )
    for sentence, labels, source in zip(
        vectors.sentences, vectors.labels, vectors.source, strict=True
    ):
        num_sentences += 1
        assert "\t" not in sentence, f"Sentence {num_sentences} has a tab: {sentence}"

        processed_sentence = PreProcessor(sentence)
        tokens = processed_sentence.tokenized_sentence
        tagged_labels = TAGGER.tag(processed_sentence.sentence_features())

        if labels == tagged_labels:
            num_correct_sentences += 1

        toknum = 0
        for label, tagged_label, token in zip(
            labels, tagged_labels, tokens, strict=True
        ):
            toknum += 1
            writer.writerow(
                map(
                    str,
                    [
                        num_sentences,
                        source,
                        sentence,
                        toknum,
                        label,
                        tagged_label,
                        token,
                    ],
                )
            )


main()

0 replies

boxydog · 2024-01-07T23:13:06Z

boxydog
Jan 7, 2024
Author

Ah perhaps no bugs, the issue is with the nytimes labeling. Below are some examples of "incorrect" labelings of the sentence "2 tablespoons extra virgin olive oil". In almost every case shown, "tablespoon" is labeled with "OTHER" (wrong), but predicted "UNIT" (right).

I still have not formed an overall impression of what I think is right and wrong in these data and predictions.

0 replies

strangetom · 2024-01-08T16:13:09Z

strangetom
Jan 8, 2024
Maintainer

Hi @boxydog

Thanks for the kind words. I'm pleased that you've been able to dig into the data and make some sense of it. Any suggestions you have for improvements are definitely welcome.

It looks like you've found the major problem with NYTimes data - the labelling of sentences is extremely inconsistent and poor. My guess is that it was originally done by a number of different people that didn't have clear guidelines for what was needed. A lot of the work I've put into this project has been cleaning up the first 30,000 sentences to make the labelling consistent. The original NYTimes dataset I used didn't have the PREP label which is why nothing in that dataset after the 30,000th sentence has a PREP labels and why that snapshot of tokens in your first post is full of tokens that should have the PREP label.

There is a subtle problem with the training data which affects the accuracy. Due to the way the sentences are labelled in the csv files, if a token appears in a sentence more than once with different labels, there is no way to know which of the possible labels is correct. The develop branch has a fix for this issue, which changes how the data is stored. I still plan to keep the csv files around because they nice and accessible. This fix improves the accuracy by about 0.5%.

4 replies

boxydog Jan 8, 2024
Author

Hmm. My first thought would be to work more on the data. Since you pointed out the "develop" branch, I see you are developing a labelling app and working on more data. Sounds like you're ahead of me there.

I wonder if publicly noting that you want more data and writing down what you think good labels are would be helpful to others who want to contribute labeled data. Often small open-source projects get no contributions, so that might be a waste of time. I would personally be curious what your criteria are, tho. For example, walking through some subtle cases and how to choose one way or another.

I feel like the labeling depends on the usage. For example, if you want to use the parsing for nutrition, then differences that change the estimated nutritional value are important, and others less so. But I'm just chatting.

I also wonder if tooling to generate what is being labeled incorrectly, or less reliably, might be interesting, to have a report for every model. I did some simple things above (tokens most often mislabeled, example sentences with those mislabelings, ..). Again, I'm just chatting.

Finally, I wonder what your motivation is. You look to be working hard. Is this library pointed at a certain application? Are you likely to continue with this effort? Yep, still just chatting. I am not 100% sure of my own motivations, it just seems interesting.

strangetom Jan 8, 2024
Maintainer

Hmm. My first thought would be to work more on the data. Since you pointed out the "develop" branch, I see you are developing a labelling app and working on more data. Sounds like you're ahead of me there.

I wonder if publicly noting that you want more data and writing down what you think good labels are would be helpful to others who want to contribute labeled data. Often small open-source projects get no contributions, so that might be a waste of time. I would personally be curious what your criteria are, tho. For example, walking through some subtle cases and how to choose one way or another.

More data could be good. I'm starting to think I've reached the point of diminishing returns on the current datasets I'm using and labelling more sentences might not provide that much benefit. My reasons for thinking this is around real-world performance. Each dataset tends to be pretty consistent in terms of how the sentences are constructed. For example, the sentences in the NYTimes dataset mostly have a single quantity and use US customary units. A model trained on just this style of sentence could get some very nice looking accuracy figures, but perform poorly when used with sentences that use metric units.

So more diverse data could be good - and this was my thinking when I introduced the cookstr and BBC datasets.

I feel like the labeling depends on the usage. For example, if you want to use the parsing for nutrition, then differences that change the estimated nutritional value are important, and others less so. But I'm just chatting.

100% agree with you. Labelling the sentences is subjective and they way I've chosen to set some labels may not be how somebody else would choose. My choices are not necessarily right and if you look through all the changes I've made to the datasets you'll find that I've changed my mind a few times. I have written some of my thinking down in the Model Guide docs: https://ingredient-parser.readthedocs.io/en/latest/guide/data.html#labelling-the-data. I could certainly improve on this.

I also wonder if tooling to generate what is being labeled incorrectly, or less reliably, might be interesting, to have a report for every model. I did some simple things above (tokens most often mislabeled, example sentences with those mislabelings, ..). Again, I'm just chatting.

It's a good thought. I have attempted a similar thing:

python train.py utility consistency -d train/data/bbc/bbc-ingredients-snapshot-2017-clean.csv

This will output an html file (after a while, it's slow) which attempt to group similar sentences together and show how they're labelled. The idea was that this might help identify any sentences that were labelled inconsistently and it has been helpful. I have vague plans to improve on this further but I'm not sure how yet, so any suggestions you have would be welcome.

Finally, I wonder what your motivation is. You look to be working hard. Is this library pointed at a certain application? Are you likely to continue with this effort? Yep, still just chatting. I am not 100% sure of my own motivations, it just seems interesting.

My original motivation for this is a personal project I have for keeping track of the recipes I cook. I wanted a way to change the number of serving or change the units of recipes and to do that I need to know the quantity and the unit of each ingredient. I started off using regular expressions, which got increasingly complex, until I came across the NYTimes project on Github, which is when this project started.

This has been a learning experience for me and I've found that when looking online for resources for learning machine learning and natural language processing, the resources tend to fall into one of two categories:

Beginner tutorials that walk through simple examples, but don't give much guidance that might help with more real-life applications
Pre-trained models, that don't give much explanation for how they were trained

So I wanted this to cover both cases: provide the labelled data and the tools used to train the model, with explanations of why I made certain choices, and provide an out-of-the-box library with a pre-trained model, but with everything needed for anyone else to train the model themselves.

boxydog Jan 9, 2024
Author

Other datasets I've come across:

https://github.com/taisti/TASTEset/tree/main. MIT license. Tags: FOOD, QUANTITY, UNIT, PROCESS, PHYSICAL_QUALITY, COLOR, TASTE, PURPOSE, PART. Pretty small, but carefully labeled.
https://huggingface.co/flax-community/t5-recipe-generation. 2.2M recipes (source: https://recipenlg.cs.put.poznan.pl/, github: https://github.com/Glorf/recipenlg/tree/main/ner/dataset), but I think it would need work to use the data. Not sure if it just has food names, or if there are other things as well.

I would encourage ideas to make the labeling less subjective. For example, perhaps a label is good if it matches an ingredient search in the USDA database. That's pretty arbitrary, but it does probably make the data more useful for other purposes.

I like where you're going with the "consistency" output, but I'd encourage looking for output where the data is wrong, or less certain. The "consistency" output seems to show everything, regardless of whether it's working well or not. Come up with some measure of "not working well" or "not sure", and focus on that. My simplistic attempt above was "tokens where % label != prediction <= 50%". That list still exists for the dataset you believe to be more accurate (first 10,000 of each file). Here are the most common tokens on the list:

Here are some sentences with those tokens misused:

I'd be willing to write some code to generate these lists, though it's not obvious where or how. Make a PR for the "develop" branch? Can I use R, or do you want python? (I hate pandas.)

If you do a good job like this on the ingredient library, I'd be curious to see the personal project.

Is there somewhere we could chat other than github? https://libera.chat/ ?

strangetom Jan 10, 2024
Maintainer

Other datasets I've come across:

Good find with both of those! Both would require a decent amount of work to be able to use them, but it looks like they both contain some labelling that could be reused so we wouldn't be starting from scratch.

Unfortunately the terms and conditions for the recipenlg dataset means we wouldn't be able to use it.

I would encourage ideas to make the labeling less subjective. For example, perhaps a label is good if it matches an ingredient search in the USDA database. That's pretty arbitrary, but it does probably make the data more useful for other purposes.

That's a good idea and I think it's worth investigating further to see how practical it would be.

Because this library spawned from my project for tracking the recipes I've cooked, the definition of what an ingredient name is was sort of based on what I, as a home cook, would expect to do as part of cooking a recipe. To give 2 examples:

sliced carrots
ground cinnamon

It's quite reasonable to expect a home cook to slice carrots, so the name here would just be 'carrots', with preparation of 'sliced'. But I wouldn't expect to grind cinnamon myself, I would buy ground cinnamon at the shop, so the name here would be 'ground cinnamon'. This is where a lot of the subjectivity comes from. Different people will have different expectations for what they would do themselves (vs buying), and that will also vary depending on where somebody lives in the world and what is available to them.

I'd be willing to write some code to generate these lists, though it's not obvious where or how. Make a PR for the "develop" branch? Can I use R, or do you want python? (I hate pandas.)

Contributions are welcome: a pull request against the develop branch would be the best approach.
I would prefer python because I don't have any experience with R and keeping the project to a single language makes it more accessible to others.

I understand your feelings towards pandas - it should be possible to avoid needing it.

Is there somewhere we could chat other than github?

I'd prefer to stick with Github for now. I think these kinds of discussions can be helpful for others to be able see and learn from.

boxydog · 2024-04-08T14:40:29Z

boxydog
Apr 8, 2024
Author

Unfortunately the terms and conditions for the recipenlg dataset means we wouldn't be able to use it.

I don't understand this. I thought recipes themselves could not be copyrighted, but I'm not a lawyer, and the Internet seems vague on the issue.

So are you saying only the TASTEset could be added?

2 replies

strangetom Apr 8, 2024
Maintainer

That's my understanding too, that a recipe can't be copyrighted. If the RecipeNLG dataset was simply a collection of recipes, then I think we would be OK to use it, because it would be no different to https://archive.org/details/recipes-en-201706 which I've already used parts of.

The potential problems are around whether the dataset has been augmented in a way that could make it copyrightable. For example, the authors of the dataset have annotated each recipe to identify the named entities, which could make it copyrightable. Furthermore, the terms the offer the dataset under state:

Researcher shall use the Dataset only for non-commercial research and educational purposes.

and I'm not confident enough that this library falls within those limits, on the basis that the license I've chosen is doesn't place that restriction on users of this library.

Putting those concerns aside for the moment and looking at what the RecipeNLG dataset would offer this library. All we would get is a very large list of recipe ingredient sentences. The annotations in the dataset don't look relevant enough to save us any time with labelling. We could simply scrape the recipes from the sites they used ourselves and save us from any ambiguity around copyright or license compatibility. As it happens, one of the recipe sources in RecipeNLG is epicurious.com, which is also in the recipes-en-201706 dataset.

TASTEset is released under an MIT license, so there's no ambiguity about making use of it here. It also looks like the annotations they made to the dataset might map fairly well to the labels used here, which would save time. It's a pretty small dataset, only ~4300 unique sentences (of ~5300 total).

boxydog Apr 8, 2024
Author

Makes sense.

boxydog · 2024-04-08T14:43:09Z

boxydog
Apr 8, 2024
Author

Another related question: if we spidered websites for recipes, could we use them on the theory that they cannot be copyrighted? Perhaps that's in a grey area you don't want to explore.

0 replies

boxydog · 2024-04-08T17:53:31Z

boxydog
Apr 8, 2024
Author

Note for myself: I've also thought about digging into the data used for labeling to see which cuisines are represented. I don't know what this would look like exactly, but presumably it's related to ingredients. Cuisine could represent another axis of dataset diversity.

0 replies

Thoughts about accuracy #6

Uh oh!

boxydog Jan 7, 2024

Replies: 7 comments · 6 replies

Uh oh!

boxydog Jan 7, 2024 Author

Uh oh!

Uh oh!

boxydog Jan 7, 2024 Author

Uh oh!

boxydog Jan 7, 2024 Author

Uh oh!

strangetom Jan 8, 2024 Maintainer

Uh oh!

Uh oh!

boxydog Jan 8, 2024 Author

Uh oh!

strangetom Jan 8, 2024 Maintainer

Uh oh!

Uh oh!

boxydog Jan 9, 2024 Author

Uh oh!

Uh oh!

strangetom Jan 10, 2024 Maintainer

Uh oh!

boxydog Apr 8, 2024 Author

Uh oh!

strangetom Apr 8, 2024 Maintainer

Uh oh!

boxydog Apr 8, 2024 Author

Uh oh!

boxydog Apr 8, 2024 Author

Uh oh!

boxydog Apr 8, 2024 Author

boxydog
Jan 7, 2024

Replies: 7 comments 6 replies

boxydog
Jan 7, 2024
Author

boxydog
Jan 7, 2024
Author

boxydog
Jan 7, 2024
Author

strangetom
Jan 8, 2024
Maintainer

boxydog Jan 8, 2024
Author

strangetom Jan 8, 2024
Maintainer

boxydog Jan 9, 2024
Author

strangetom Jan 10, 2024
Maintainer

boxydog
Apr 8, 2024
Author

strangetom Apr 8, 2024
Maintainer

boxydog Apr 8, 2024
Author

boxydog
Apr 8, 2024
Author

boxydog
Apr 8, 2024
Author