Docs: Update docs

strangetom · strangetom · commit 46016e094317 · 2025-04-20T14:36:28.000+01:00
features - remove semantic features
foundation - update details about foundation food matching
diff --git a/docs/source/explanation/features.rst b/docs/source/explanation/features.rst
@@ -107,11 +107,7 @@ Semantic features
 Semantic features are determined from the meaning of the token
 In practice this means making use of word embeddings, which are a method to encode a word as a numeric vector in such a way that the vectors for words with similar meanings are clustered close together.
 
-An embeddings model has been trained using `floret <https://github.com/explosion/floret>`_ from the same data used to train the sequence tagging model.
-This model encodes words as 10-dimensional vectors (chosen to reduce the size of the model).
-For each token, the corresponding 10-dimensional vector can be calculated and used as a feature.
-
-Due to limitations of the `python-crfsuite <https://github.com/scrapinghub/python-crfsuite>`_, which cannot make use of features that are lists, each dimension of the vector is turned into a separate feature.
+Currently semantic features are not used as features for the parser model, but this is being investigated.
 
 Example
 ^^^^^^^
@@ -136,16 +132,6 @@ Below is an example of the features generated for one of the tokens in an ingred
       'is_after_comma': False,
       'is_after_plus': False,
       'word_shape': 'xxx',
-      'v0': -0.139836490154,
-      'v1': 0.335813522339,
-      'v2': 0.772642672062,
-      'v3': -0.165960505605,
-      'v4': 0.16534408927,
-      'v5': -0.356404691935,
-      'v6': 0.335878640413,
-      'v7': -0.614531040192,
-      'v8': 0.474092006683,
-      'v9': -0.137665584683,
       'prev_stem': '!num',
       'prev_pos': 'CD+NN',
       'prev_is_capitalised': False,
@@ -156,16 +142,6 @@ Below is an example of the features generated for one of the tokens in an ingred
       'prev_is_after_comma': False,
       'prev_is_after_plus': False,
       'prev_word_shape': '!xxx',
-      'prev_v0': -0.228524670005,
-      'prev_v1': 0.118124544621,
-      'prev_v2': 0.474654018879,
-      'prev_v3': 0.006919545121,
-      'prev_v4': 0.293126374483,
-      'prev_v5': -0.280303806067,
-      'prev_v6': 0.479749411345,
-      'prev_v7': -0.370705068111,
-      'prev_v8': -0.055196929723,
-      'prev_v9': -0.28187289834,
       'next_stem': 'orang',
       'next_pos': 'NN+NN',
       'next_is_capitalised': False,
@@ -176,16 +152,6 @@ Below is an example of the features generated for one of the tokens in an ingred
       'next_is_after_comma': False,
       'next_is_after_plus': False,
       'next_word_shape': 'xxxxxx',
-      'next_v0': -0.988151550293,
-      'next_v1': 1.244541049004,
-      'next_v2': -0.004523974378,
-      'next_v3': 0.618911862373,
-      'next_v4': 0.682275772095,
-      'next_v5': 0.035868640989,
-      'next_v6': -0.350227534771,
-      'next_v7': -1.441177010536,
-      'next_v8': -1.112710833549,
-      'next_v9': 0.280764371157,
       'next2_stem': 'juic',
       'next2_pos': 'NN+NN+NN',
       'next2_is_capitalised': False,
diff --git a/docs/source/explanation/foundation.rst b/docs/source/explanation/foundation.rst
@@ -44,15 +44,17 @@ the :func:`parse_ingredient <ingredient_parser.parsers.parse_ingredient>` functi
         comment=None,
         purpose=None,
         foundation_foods=[FoundationFood(text='Basil, fresh',
-                                         confidence=0.837206,
+                                         confidence=0.862222,
                                          fdc_id=172232,
                                          category='Spices and Herbs',
-                                         data_type='sr_legacy_food'),
-                          FoundationFood(text='Spices, basil, dried',
-                                         confidence=0.836425,
-                                         fdc_id=171317,
-                                         category='Spices and Herbs',
-                                         data_type='sr_legacy_food')],
+                                         data_type='sr_legacy_food',
+                                         url='https://fdc.nal.usda.gov/food-details/172232/nutrients'),
+                           FoundationFood(text='Spices, basil, dried',
+                                          confidence=0.856791,
+                                          fdc_id=171317,
+                                          category='Spices and Herbs',
+                                          data_type='sr_legacy_food',
+                                          url='https://fdc.nal.usda.gov/food-details/171317/nutrients')],
         sentence='24 fresh basil leaves or dried basil'
     )
 
@@ -69,7 +71,8 @@ In these cases, we still want to select the correct entry.
 
 The approach taken attempts to match ingredient names to :abbr:`FDC (Food Data Central)` entries based on semantic similarity, that is, selecting the entry that is closest in meaning to the ingredient name even where the words used are not the identical.
 Two semantic matching techniques are used, based on [Ethayarajh]_ and [Morales-Garzón]_.
-Both techniques make use of the word embeddings model that is also used to provide semantic features for the parser model.
+Both techniques make use of a word embeddings model.
+A `GloVe <https://nlp.stanford.edu/projects/glove/>`_ embeddings model trained on text from a large corpus of recipes and is used to provide the information for the semantic similarity techniques.
 
 Unsupervised Smooth Inverse Frequency
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -82,13 +85,8 @@ This approach is applied to the descriptions for each of the :abbr:`FDC (Food Da
 The best match is selected using the cosine similarity metric.
 
 In practice, this technique is generally pretty good at finding a reasonable matching :abbr:`FDC (Food Data Central)` entry.
-However, in some cases the best matching entry is not an appropriate match.
-There may be two causes for this:
-
-1. The quality of the embeddings is good enough for this technique to be more robust.
-
-2. Not removing the common component between vectors is causing worse performance than if it was removed.
-
+However, in some cases the match with the best score is not an appropriate match.
+The reason for this is likely due to limitations in the quality of the embeddings used.
 
 Fuzzy Document Distance
 ~~~~~~~~~~~~~~~~~~~~~~~
@@ -104,7 +102,7 @@ Combined
 
 The two techniques are combined to perform the matching of an ingredient name to an :abbr:`FDC (Food Data Central)` entry.
 
-First, :abbr:`uSIF (Unsupervised Smooth Inverse Frequency)` is used to down select a list of candidate matches from the full set of :abbr:`FDC (Food Data Central)` entries
+First, :abbr:`uSIF (Unsupervised Smooth Inverse Frequency)` is used to down select a list of *n* candidate matches from the full set of :abbr:`FDC (Food Data Central)` entries
 
 Second, the fuzzy document distance is calculated for the down selected candidate matches.
 
@@ -120,7 +118,7 @@ The current implementation has a some limitations.
    Work is ongoing to improve this, and suggestions and contributions are welcome.
 
 #. Enabling this functionality is much slower than when not enabled.
-   The foundation foods functionality is roughly 80x slower.
+   When enabled, parsing a sentence is roughly 75x slower than if disabled .
 
 References
 ^^^^^^^^^^
diff --git a/ingredient_parser/en/_foundationfoods.py b/ingredient_parser/en/_foundationfoods.py
@@ -295,12 +295,12 @@ def find_candidate_matches(
         tokens : list[str]
             List of tokens.
         n : int
-            Number of matches to return, sorted by score.
+            Number of matches to return, sorted by best score.
 
         Returns
         -------
         list[FDCIngredientMatch]
-            List of candidate matching FDC ingredients.
+            List of best n candidate matching FDC ingredients.
         """
         prepared_tokens = prepare_embeddings_tokens(tuple(tokens))
         input_token_vector = self._embed(prepared_tokens)