Spanish ingredient-parser #14

theripnono · 2024-04-01T06:36:16Z

theripnono
Apr 1, 2024

Hi, i have pull the project to try to train a spanish ingredient parser to contribute to your amazing work. However I'm having some difficulties to train the model because I don't understand how to do it.
I have made some changes in funcs.py and add spanish units and spanish words to tokinize, and now the idea is to upload csv that i have scrapped from internet sources.
Could you help me, and explain what is the best way to train the model? Thank you!!

Answered by strangetom

Apr 1, 2024

Hi @theripnono

The command to train a new model is

python train.py train --database train/data/training.sqlite3

For this to work well you will need to do a few things:

Create a database of training data, or modify the existing one. The database contains a table called training, which has the following fields:

id: a unique ID for each sentence (this isn't used for training)
source: where the sentence came from (this isn't used for training)
sentence: the ingredient sentence
tokens: the list of tokens from the sentence
labels: the list of labels for each token in the sentence

The hard part is making sure all the labels are correct and consistent.

Consider any modifications t…

View full answer

strangetom · 2024-04-01T16:11:34Z

strangetom
Apr 1, 2024
Maintainer

Hi @theripnono

The command to train a new model is

python train.py train --database train/data/training.sqlite3

For this to work well you will need to do a few things:

Create a database of training data, or modify the existing one. The database contains a table called training, which has the following fields:

id: a unique ID for each sentence (this isn't used for training)
source: where the sentence came from (this isn't used for training)
sentence: the ingredient sentence
tokens: the list of tokens from the sentence
labels: the list of labels for each token in the sentence

The hard part is making sure all the labels are correct and consistent.

Consider any modifications to the PreProcessor class. It sounds like you've made some modifications, but there may be more you need. The current PreProcessor is written for English sentences, so there might be some changes you need to make for Spanish. For example, the functions that convert fractions to decimals might need changing to use a decimal comma instead of a decimal point (1,5 instead of 1.5). You might also need to change how the tokenizer works so that the sentence is split correctly.
Consider any modification to the PostProcessor class. A lot code here is focused on combining QTY and UNIT labels into amounts. This may not work well for Spanish, so I think a good starting point would be to change PostProcessor._process_amounts() to

    def _postprocess_amounts(self) -> list[IngredientAmount]:
        funcs = [
            #self._sizable_unit_pattern,  # Comment this out
            #self._composite_amounts_pattern,  # Comment this out
            self._fallback_pattern,
        ]

        amounts = []
        for func in funcs:
            idx = self._unconsumed(list(range(len(self.tokens))))
            tokens = self._unconsumed(self.tokens)
            labels = self._unconsumed(self.labels)
            scores = self._unconsumed(self.scores)

            parsed_amounts = func(idx, tokens, labels, scores)
            amounts.extend(parsed_amounts)

        return sorted(amounts, key=lambda x: x._starting_index)

and see how well that performs.

I hope this helps you get started. Feel free to ask more questions, and I hope you're able to make this work successfully.

1 reply

theripnono Apr 2, 2024
Author

Thank you for your quick reply!
Yes, I've made some changes to some functions to tokenise into Spanish. I will try with your suggestions.

theripnono · 2024-04-10T11:18:19Z

theripnono
Apr 10, 2024
Author

Hi again, I'm hard working on the spanish dataset (around 20k rows) for training the model. I'm labelling the tokenized sentences and two question come to mind:

Should I remove duplicate rows? e.g.:

and again i have
Once I have finished labelling, do I train the existing model or do I have to train the model from scratch?

Thank you!

For now I'm labbeling as it is shown in the picture below:

3 replies

strangetom Apr 10, 2024
Maintainer

The labelling looks good.

Should I remove duplicate rows?

Having a small number of duplicates will be fine. If you have lots, then it won't be helpful for training a model that works well for sentences it's not been trained on, but you might get performance statistics that make it look good.

When I added the bbc and cookstr datasets to the training data, I did try to make each sentence unique. However about 20% of the data used to the train the model at the moment is a duplicate - something I should look into doing something about.

Once I have finished labelling, do I train the existing model or do I have to train the model from scratch?

I think you'll be best training a new model. With 20,000 sentences, it should only take a few minutes. The command you want will be something like

$ python train.py train --database path/to/your/database --model model.es.crfsuite

You can add additional options like --html to output an html file with the evaluation sentences the model got wrong and --detailed to output files containing statistics about the tokens and labels in the evaluation data.

You will need to edit the load_datasets function in train/training_utils.py so the sql on line 102 is selecting the data from the correct table name.

If you're able to get it working, if you document any changes you had to make to train the model then I would be happy to make modifications here so it's easier to train models for other languages.

theripnono Apr 11, 2024
Author

Having a small number of duplicates will be fine. If you have lots, then it won't be helpful for training a model that works well for sentences it's not been trained on, but you might get performance statistics that make it look good.

Well, I have similar sentences because some recipes use the same ingredient name, but diferent quantity. For example:
[250,g,de,pechuga,de,pollo]
[500,g,de,pechuga,de,pollo]

If you're able to get it working, if you document any changes you had to make to train the model then I would be happy to make modifications here so it's easier to train models for other languages.

For now I've made some changes in the code for spanish stem words i've added lang parameter

 def parse_ingredient(
    sentence: str, discard_isolated_stop_words: bool = True,
    lang: str = None) -> ParsedIngredient:

and in funcs.py I've added:

from nltk.stem.snowball import SnowballStemmer

def stem(token: str, lang: str) -> str:
    if not lang:
        return STEMMER.stem(token)
    else:
        lang in ['arabic', 'danish', 'dutch', 'finnish', 'french', 'german',
                'hungarian', 'italian', 'norwegian', 'portuguese',
                'romanian', 'russian', 'spanish', 'swedish']
        
        snowball_stemmer = SnowballStemmer(lang) 
        return snowball_stemmer.stem(token)

Also, as you can imagine in _constants.py i've added spanish units, words etc....for better tokenization.

I will focus on the dataset. It will take me a long time I gues...

strangetom Apr 11, 2024
Maintainer

Similar sentences like the ones you mentioned are fine and I think are useful to help reinforce the correct labelling in the model. It's only when there are lots of the exact same sentence that I think it might cause problems and be worth doing something about.

Thanks for those changes. I will have a look to see how I can incorporate them into the library. I don't have any specific plans to support multiple languages yet, but if I can make it easier to do so then I will.

I will focus on the dataset. It will take me a long time I gues...

Unfortunately that is the case. You should be able to check that it's going to work with about 1000-2000 sentences. It won't be as good as the model trained on 20,000 sentences, but should give you confidence.

theripnono · 2024-05-03T15:58:18Z

theripnono
May 3, 2024
Author

Hi again! I've good news! I have done a little util using GPT and langchain because I didn't want to spend much time creating the dataset.
In my first test, I've created 5000 rows of data, so I guess in this week I will have 20k dataset in spanish.

I hope to give you news ASAP :)

0 replies

theripnono · 2024-05-04T07:54:23Z

theripnono
May 4, 2024
Author

Hi again @strangetom, sorry to keep asking so many... I'm traying to train the model, but I don't know the steps I need to follow.
I'have my csv labelled "train.csv" which have this info:

I makes me doubts because when I see your csv formats:

e.g:

Do I need to create another "training.sqlite3" file with my data or should insert in the same db my data?

what are the steps I should take?
thank tom!!

3 replies

strangetom May 4, 2024
Maintainer

I recommend creating a new database with your data in it. I had originally thought that a single database could have data for multiple languages but I don't think that will work well with git, so separate databases would be better.

The csv files that I have in the repository have the entire datasets for each of those sources. Only a subset of that data is properly labelled and put in the database for training the model.

theripnono May 4, 2024
Author

I recommend creating a new database with your data in it. I had originally thought that a single database could have data for multiple languages but I don't think that will work well with git, so separate databases would be better.

So I need to create a db with the name training_spa.sqlite3 for example and run the command as:

$ python train.py train --database train/data/training_spa.sqlite3

I can add a folder with the csv so other people can use it, can't I?

strangetom May 4, 2024
Maintainer

Yep, that should do it.

I can add a folder with the csv so other people can use it, can't I?

I think that would be helpful. It would also be good if you could add a README.md alongside it explaining where the data came from - you can look in the existing folders for examples.

Spanish ingredient-parser #14

Uh oh!

theripnono Apr 1, 2024

Replies: 4 comments · 7 replies

Uh oh!

strangetom Apr 1, 2024 Maintainer

Uh oh!

theripnono Apr 2, 2024 Author

Uh oh!

theripnono Apr 10, 2024 Author

Uh oh!

strangetom Apr 10, 2024 Maintainer

Uh oh!

theripnono Apr 11, 2024 Author

Uh oh!

strangetom Apr 11, 2024 Maintainer

Uh oh!

theripnono May 3, 2024 Author

Uh oh!

theripnono May 4, 2024 Author

Uh oh!

strangetom May 4, 2024 Maintainer

Uh oh!

theripnono May 4, 2024 Author

Uh oh!

strangetom May 4, 2024 Maintainer

theripnono
Apr 1, 2024

Replies: 4 comments 7 replies

strangetom
Apr 1, 2024
Maintainer

theripnono Apr 2, 2024
Author

theripnono
Apr 10, 2024
Author

strangetom Apr 10, 2024
Maintainer

theripnono Apr 11, 2024
Author

strangetom Apr 11, 2024
Maintainer

theripnono
May 3, 2024
Author

theripnono
May 4, 2024
Author

strangetom May 4, 2024
Maintainer

theripnono May 4, 2024
Author

strangetom May 4, 2024
Maintainer