Skip to content

Commit 46016e0

Browse files
committed
Docs: Update docs
features - remove semantic features foundation - update details about foundation food matching
1 parent b4aac61 commit 46016e0

File tree

3 files changed

+18
-54
lines changed

3 files changed

+18
-54
lines changed

docs/source/explanation/features.rst

Lines changed: 1 addition & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -107,11 +107,7 @@ Semantic features
107107
Semantic features are determined from the meaning of the token
108108
In practice this means making use of word embeddings, which are a method to encode a word as a numeric vector in such a way that the vectors for words with similar meanings are clustered close together.
109109

110-
An embeddings model has been trained using `floret <https://github.com/explosion/floret>`_ from the same data used to train the sequence tagging model.
111-
This model encodes words as 10-dimensional vectors (chosen to reduce the size of the model).
112-
For each token, the corresponding 10-dimensional vector can be calculated and used as a feature.
113-
114-
Due to limitations of the `python-crfsuite <https://github.com/scrapinghub/python-crfsuite>`_, which cannot make use of features that are lists, each dimension of the vector is turned into a separate feature.
110+
Currently semantic features are not used as features for the parser model, but this is being investigated.
115111

116112
Example
117113
^^^^^^^
@@ -136,16 +132,6 @@ Below is an example of the features generated for one of the tokens in an ingred
136132
'is_after_comma': False,
137133
'is_after_plus': False,
138134
'word_shape': 'xxx',
139-
'v0': -0.139836490154,
140-
'v1': 0.335813522339,
141-
'v2': 0.772642672062,
142-
'v3': -0.165960505605,
143-
'v4': 0.16534408927,
144-
'v5': -0.356404691935,
145-
'v6': 0.335878640413,
146-
'v7': -0.614531040192,
147-
'v8': 0.474092006683,
148-
'v9': -0.137665584683,
149135
'prev_stem': '!num',
150136
'prev_pos': 'CD+NN',
151137
'prev_is_capitalised': False,
@@ -156,16 +142,6 @@ Below is an example of the features generated for one of the tokens in an ingred
156142
'prev_is_after_comma': False,
157143
'prev_is_after_plus': False,
158144
'prev_word_shape': '!xxx',
159-
'prev_v0': -0.228524670005,
160-
'prev_v1': 0.118124544621,
161-
'prev_v2': 0.474654018879,
162-
'prev_v3': 0.006919545121,
163-
'prev_v4': 0.293126374483,
164-
'prev_v5': -0.280303806067,
165-
'prev_v6': 0.479749411345,
166-
'prev_v7': -0.370705068111,
167-
'prev_v8': -0.055196929723,
168-
'prev_v9': -0.28187289834,
169145
'next_stem': 'orang',
170146
'next_pos': 'NN+NN',
171147
'next_is_capitalised': False,
@@ -176,16 +152,6 @@ Below is an example of the features generated for one of the tokens in an ingred
176152
'next_is_after_comma': False,
177153
'next_is_after_plus': False,
178154
'next_word_shape': 'xxxxxx',
179-
'next_v0': -0.988151550293,
180-
'next_v1': 1.244541049004,
181-
'next_v2': -0.004523974378,
182-
'next_v3': 0.618911862373,
183-
'next_v4': 0.682275772095,
184-
'next_v5': 0.035868640989,
185-
'next_v6': -0.350227534771,
186-
'next_v7': -1.441177010536,
187-
'next_v8': -1.112710833549,
188-
'next_v9': 0.280764371157,
189155
'next2_stem': 'juic',
190156
'next2_pos': 'NN+NN+NN',
191157
'next2_is_capitalised': False,

docs/source/explanation/foundation.rst

Lines changed: 15 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -44,15 +44,17 @@ the :func:`parse_ingredient <ingredient_parser.parsers.parse_ingredient>` functi
4444
comment=None,
4545
purpose=None,
4646
foundation_foods=[FoundationFood(text='Basil, fresh',
47-
confidence=0.837206,
47+
confidence=0.862222,
4848
fdc_id=172232,
4949
category='Spices and Herbs',
50-
data_type='sr_legacy_food'),
51-
FoundationFood(text='Spices, basil, dried',
52-
confidence=0.836425,
53-
fdc_id=171317,
54-
category='Spices and Herbs',
55-
data_type='sr_legacy_food')],
50+
data_type='sr_legacy_food',
51+
url='https://fdc.nal.usda.gov/food-details/172232/nutrients'),
52+
FoundationFood(text='Spices, basil, dried',
53+
confidence=0.856791,
54+
fdc_id=171317,
55+
category='Spices and Herbs',
56+
data_type='sr_legacy_food',
57+
url='https://fdc.nal.usda.gov/food-details/171317/nutrients')],
5658
sentence='24 fresh basil leaves or dried basil'
5759
)
5860
@@ -69,7 +71,8 @@ In these cases, we still want to select the correct entry.
6971

7072
The approach taken attempts to match ingredient names to :abbr:`FDC (Food Data Central)` entries based on semantic similarity, that is, selecting the entry that is closest in meaning to the ingredient name even where the words used are not the identical.
7173
Two semantic matching techniques are used, based on [Ethayarajh]_ and [Morales-Garzón]_.
72-
Both techniques make use of the word embeddings model that is also used to provide semantic features for the parser model.
74+
Both techniques make use of a word embeddings model.
75+
A `GloVe <https://nlp.stanford.edu/projects/glove/>`_ embeddings model trained on text from a large corpus of recipes and is used to provide the information for the semantic similarity techniques.
7376

7477
Unsupervised Smooth Inverse Frequency
7578
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -82,13 +85,8 @@ This approach is applied to the descriptions for each of the :abbr:`FDC (Food Da
8285
The best match is selected using the cosine similarity metric.
8386

8487
In practice, this technique is generally pretty good at finding a reasonable matching :abbr:`FDC (Food Data Central)` entry.
85-
However, in some cases the best matching entry is not an appropriate match.
86-
There may be two causes for this:
87-
88-
1. The quality of the embeddings is good enough for this technique to be more robust.
89-
90-
2. Not removing the common component between vectors is causing worse performance than if it was removed.
91-
88+
However, in some cases the match with the best score is not an appropriate match.
89+
The reason for this is likely due to limitations in the quality of the embeddings used.
9290

9391
Fuzzy Document Distance
9492
~~~~~~~~~~~~~~~~~~~~~~~
@@ -104,7 +102,7 @@ Combined
104102

105103
The two techniques are combined to perform the matching of an ingredient name to an :abbr:`FDC (Food Data Central)` entry.
106104

107-
First, :abbr:`uSIF (Unsupervised Smooth Inverse Frequency)` is used to down select a list of candidate matches from the full set of :abbr:`FDC (Food Data Central)` entries
105+
First, :abbr:`uSIF (Unsupervised Smooth Inverse Frequency)` is used to down select a list of *n* candidate matches from the full set of :abbr:`FDC (Food Data Central)` entries
108106

109107
Second, the fuzzy document distance is calculated for the down selected candidate matches.
110108

@@ -120,7 +118,7 @@ The current implementation has a some limitations.
120118
Work is ongoing to improve this, and suggestions and contributions are welcome.
121119

122120
#. Enabling this functionality is much slower than when not enabled.
123-
The foundation foods functionality is roughly 80x slower.
121+
When enabled, parsing a sentence is roughly 75x slower than if disabled .
124122

125123
References
126124
^^^^^^^^^^

ingredient_parser/en/_foundationfoods.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -295,12 +295,12 @@ def find_candidate_matches(
295295
tokens : list[str]
296296
List of tokens.
297297
n : int
298-
Number of matches to return, sorted by score.
298+
Number of matches to return, sorted by best score.
299299
300300
Returns
301301
-------
302302
list[FDCIngredientMatch]
303-
List of candidate matching FDC ingredients.
303+
List of best n candidate matching FDC ingredients.
304304
"""
305305
prepared_tokens = prepare_embeddings_tokens(tuple(tokens))
306306
input_token_vector = self._embed(prepared_tokens)

0 commit comments

Comments
 (0)