strangetom
diff --git a/‎README.md‎
Lines changed: 5 additions & 5 deletions b/‎README.md‎
Lines changed: 5 additions & 5 deletions
diff --git a/‎docs/source/conf.py‎
Lines changed: 1 addition & 0 deletions b/‎docs/source/conf.py‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/source/explanation/foundation.rst‎
Lines changed: 48 additions & 30 deletions b/‎docs/source/explanation/foundation.rst‎
Lines changed: 48 additions & 30 deletions
diff --git a/‎docs/source/reference/index.rst‎
Lines changed: 1 addition & 0 deletions b/‎docs/source/reference/index.rst‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/source/resources/index.rst‎
Lines changed: 4 additions & 0 deletions b/‎docs/source/resources/index.rst‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎ingredient_parser/en/ModelCard.en.md‎
Lines changed: 1 addition & 1 deletion b/‎ingredient_parser/en/ModelCard.en.md‎
Lines changed: 1 addition & 1 deletion
@@ -48,13 +48,13 @@ The model has the following accuracy on a test data set of 20% of the total data
 
 ```
 Sentence-level results:
-	Accuracy: 94.68%
+	Accuracy: 94.50%
 
 Word-level results:
-	Accuracy 97.86%
-	Precision (micro) 97.84%
-	Recall (micro) 97.86%
-	F1 score (micro) 97.84%
+	Accuracy 97.78%
+	Precision (micro) 97.76%
+	Recall (micro) 97.78%
+	F1 score (micro) 97.76%
 ```
 
 ## Development
 
@@ -28,6 +28,7 @@
 
 # autodoc_typehints = "none"
 napoleon_numpy_docstring = True
+napoleon_use_admonition_for_references = True
 napoleon_use_param = False
 napoleon_use_rtype = False
 napoleon_use_keyword = False
 
@@ -3,7 +3,7 @@ Foundation foods
 
 .. versionchanged:: v2.1.0
 
-    This functionality now matches ingredients to the FDC database.
+    This functionality matches ingredients to the FDC database.
 
     Whilst the changes are API compatible with earlier versions, the contents of the fields in the :class:`FoundationFood <ingredient_parser.dataclasses.FoundationFood>` objects are different.
 
@@ -23,17 +23,17 @@ the :func:`parse_ingredient <ingredient_parser.parsers.parse_ingredient>` functi
     >>> parse_ingredient("1 large organic cucumber", foundation_foods=True)
     ParsedIngredient(
         name=[IngredientText(text='fresh basil leaves',
-                             confidence=0.970321,
+                             confidence=0.973058,
                              starting_index=1),
               IngredientText(text='dried basil',
-                             confidence=0.843839,
+                             confidence=0.87027,
                              starting_index=5)],
         size=None,
         amount=[IngredientAmount(quantity=Fraction(24, 1),
                                  quantity_max=Fraction(24, 1),
                                  unit='',
                                  text='24',
-                                 confidence=0.999585,
+                                 confidence=0.999702,
                                  starting_index=0,
                                  APPROXIMATE=False,
                                  SINGULAR=False,
@@ -44,61 +44,72 @@ the :func:`parse_ingredient <ingredient_parser.parsers.parse_ingredient>` functi
         comment=None,
         purpose=None,
         foundation_foods=[FoundationFood(text='Basil, fresh',
-                                         confidence=0.838445,
+                                         confidence=0.837206,
                                          fdc_id=172232,
                                          category='Spices and Herbs',
                                          data_type='sr_legacy_food'),
                           FoundationFood(text='Spices, basil, dried',
-                                         confidence=0.839727,
+                                         confidence=0.836425,
                                          fdc_id=171317,
                                          category='Spices and Herbs',
                                          data_type='sr_legacy_food')],
         sentence='24 fresh basil leaves or dried basil'
     )
 
-
 Explanation
 ^^^^^^^^^^^
 
 The matching of the ingredient names to entries in the :abbr:`FDC (Food Data Central)` database is a difficult problem.
-This is due to the descriptions of the entries in the :abbr:`FDC (Food Data Central)` database being quite different to the way the ingredients are commonly referred to in recipes.
+The descriptions of the entries in the :abbr:`FDC (Food Data Central)` database are quite different to the way the ingredients are commonly referred to in recipes, therefore matching one to the other is a challenge.
+For example, the matching entry for **spring onions** has a description of **Onions, spring or scallions (includes tops and bulb), raw**.
 
-For example,  the matching entry for **red pepper** has a description of **peppers, bell, red, raw**.
+Typical fuzzy matching approaches that use character or token level changes to scores matches will not work well because difference in string lengths.
+In addition, there may be more than one word for a particular ingredient and the word used in the :abbr:`FDC (Food Data Central)` entry description might not be the same as the word used in the ingredient sentence.
+In these cases, we still want to select the correct entry.
 
-The approach taken is based on the paper `A Word Embedding Model for Mapping Food Composition Databases Using Fuzzy Logic <https://dx.doi.org/10.1007/978-3-030-50143-3_50>`_.
-The same word embeddings model used to provide semantic features for the :abbr:`CRF (Conditional Random Fields)` parser model is used provide the word vectors for each token in the ingredient name and each token in the description of the :abbr:`FDC (Food Data Central)` entries.
-These vectors are used to compute a fuzzy distance score between the ingredient name tokens and each :abbr:`FDC (Food Data Central)` entry, which is used to find the best matching :abbr:`FDC (Food Data Central)` entry.
+The approach taken attempts to match ingredient names to :abbr:`FDC (Food Data Central)` entries based on semantic similarity, that is, selecting the entry that is closest in meaning to the ingredient name even where the words used are not the identical.
+Two semantic matching techniques are used, based on [Ethayarajh]_ and [Morales-Garzón]_.
+Both techniques make use of the word embeddings model that is also used to provide semantic features for the parser model.
 
-The full process is as follows:
+Unsupervised Smooth Inverse Frequency
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-#. Load the :abbr:`FDC (Food Data Central)` data. Tokenize the description for each entry and remove tokens that don't provide useful semantic information\*.
+The technique described in [Ethayarajh]_ is called Unsupervised Smooth Inverse Frequency (uSIF).
+This technique calculates an embedding vector for a sentence from the weighted vectors of the words, where the weight is related to the probability of encountering the word (related to the inverse frequency of the word).
+The technique also removes common components in the word vectors, although this is not implemented here (primarily due to not wanting to include a further runtime dependency of sklearn - this may change in the future if it proves to be helpful).
 
-#. Prepare the ingredient name tokens in the same way.
+This approach is applied to the descriptions for each of the :abbr:`FDC (Food Data Central)` entries and ingredient name we are trying to find the closest match to.
+The best match is selected using the cosine similarity metric.
 
-#. Check the ingredient name tokens against a list of override matches and return the matching result if there is one.
-   This is done because this approach does not work very well if the ingredient name only contains a single token.
+In practice, this technique is generally pretty good at finding a reasonable matching :abbr:`FDC (Food Data Central)` entry.
+However, in some cases the best matching entry is not an appropriate match.
+There may be two causes for this:
 
-#. Iterate through each of the :abbr:`FDC (Food Data Central)` data types in turn, in order of preference.
+1. The quality of the embeddings is good enough for this technique to be more robust.
 
-   #. Compute the fuzzy distance score between each :abbr:`FDC (Food Data Central)` entry and the ingredient name tokens.
+2. Not removing the common component between vectors is causing worse performance than if it was removed.
 
-   #. Sort the :abbr:`FDC (Food Data Central)` entries by the fuzzy distance score.
 
-   #. If the lowest (best) score is below the threshold, return the :class:`FoundationFood <ingredient_parser.dataclasses.FoundationFood>` object for the corresponding :abbr:`FDC (Food Data Central)` entry.
+Fuzzy Document Distance
+~~~~~~~~~~~~~~~~~~~~~~~
 
-   #. If best score is not below the threshold, store the best entry and it's score for fallback matching.
+The fuzzy document distance metric is described in [Morales-Garzón]_.
+Each sentence is considered as a set of tokens, and the distance is calculated from the Euclidean distance between tokens in two sentences being compared.
+By considering the embedding vector for each token individually, this metric yields different results to :abbr:`uSIF (Unsupervised Smooth Inverse Frequency)` but is quote effective nonetheless.
 
-#. If none of the :abbr:`FDC (Food Data Central)` datasets contained a good enough match, attempt fallback matching.
+The results using this approach are more explainable than the result from :abbr:`uSIF (Unsupervised Smooth Inverse Frequency)`, however the implementation of this metric has the downside of being significantly slower.
 
-   #. Sort the best matches from each :abbr:`FDC (Food Data Central)` data set.
+Combined
+~~~~~~~~
 
-   #. If the score for the best of these matches is below a threshold, return this match.
+The two techniques are combined to perform the matching of an ingredient name to an :abbr:`FDC (Food Data Central)` entry.
 
-#. If no match is good enough, return ``None``.
+First, :abbr:`uSIF (Unsupervised Smooth Inverse Frequency)` is used to down select a list of candidate matches from the full set of :abbr:`FDC (Food Data Central)` entries
 
-.. note::
+Second, the fuzzy document distance is calculated for the down selected candidate matches.
 
-    \*Tokens that do not provide useful semantic information are as follows: numbers, white space, punctuation, stop words, single character words.
+Finally the best scoring match is selected, accounting for the preference in :abbr:`FDC (Food Data Central)` data type.
+In summary, if there are other :abbr:`FDC (Food Data Central)` entries with fuzzy document distances that are very similar to the best, then the select entry is based on the preferred data type rather than just based on the best score.
 
 Limitations
 ^^^^^^^^^^^
@@ -108,5 +119,12 @@ The current implementation has a some limitations.
 #. The fuzzy distance scoring will sometimes result in returning an :abbr:`FDC (Food Data Central)` entry that has a good score but is not a good match.
    Work is ongoing to improve this, and suggestions and contributions are welcome.
 
-#. This functionality can be very slow.
-   The more datasets that need to be checked to find a good match, the slower it will be.
+#. Enabling this functionality is much slower than when not enabled.
+   The foundation foods functionality is roughly 80x slower.
+
+References
+^^^^^^^^^^
+
+.. [Ethayarajh] Kawin Ethayarajh. 2018. Unsupervised Random Walk Sentence Embeddings: A Strong but Simple Baseline. In Proceedings of the Third Workshop on Representation Learning for NLP, pages 91–100, Melbourne, Australia. Association for Computational Linguistics. https://aclanthology.org/W18-3012/
+
+.. [Morales-Garzón] Morales-Garzón, A., Gómez-Romero, J., Martin-Bautista, M.J. (2020). A Word Embedding Model for Mapping Food Composition Databases Using Fuzzy Logic. In: Lesot, MJ., et al. Information Processing and Management of Uncertainty in Knowledge-Based Systems. IPMU 2020. Communications in Computer and Information Science, vol 1238. Springer, Cham. https://doi.org/10.1007/978-3-030-50143-3_50
@@ -11,3 +11,4 @@ API Reference
    postprocessor
    common
    dataclasses
+   foundationfoods
@@ -58,3 +58,7 @@ Papers
     Food is one of the main health and environmental factors in today’s society. With modernization the food supply is expanding and food-related data is increasing. This type of data comes in many different forms and making it inter-operable is one of the main requirements for using in any kind of analyses. One step towards this goal is data normalization of data coming from different sources. Food-related is collected regarding various aspects – food composition, food consumption, recipe data, etc. The most commonly encountered form is food data related to food products, which in order to serve its purpose – sales and profits, is often distorted and manipulated for marketing plans of producers and retailers. This causes the data to be often misinterpreted. There exist some studies addressing the problem of heterogeneous data by data normalization based on lexical similarity of the food products’ English names.
 
     We took this task a step further by considering data in non-English, low-resourced language – Slovenian. Working with such languages is challenging, as they have very limited resources and tools for Natural Language Processing (NLP). In our previously published work we considered different heuristics for matching food products: one based on lexical similarity, and two semantic similarity heuristics, i.e. based on word vector representations (embeddings). These data normalization approaches are evaluated once on a data set with 439 ground truth pairs of food products, obtained by matching their EAN barcodes. In this work, we extend this approach by introducing a new semantic similarity heuristic, based on sentence vector embeddings. Additionally, we extend the evaluation by taking real-world examples and tasking a subject-matter expert to rate the relevance of the top three matches for each example. The results show that using semantic similarity with the sentence embedding method yields best results, achieving 88% accuracy for the ground truth data set and 91% accuracy from the human expert evaluation, while the lexical similarity heuristic provides comparing results with 75% and 85% accuracy.
+
+* `Unsupervised Random Walk Sentence Embeddings: A Strong but Simple Baseline <https://aclanthology.org/W18-3012>`_ (aclanthology.org, 2018)
+
+    Using a random walk model of text generation, Arora et al. (2017) proposed a strong baseline for computing sentence embeddings: take a weighted average of word embeddings and modify with SVD. This simple method even outperforms far more complex approaches such as LSTMs on textual similarity tasks. In this paper, we ﬁrst show that word vector length has a confounding effect on the probability of a sentence being generated in Arora et al.’s model. We propose a random walk model that is robust to this confound, where the probability of word generation is inversely related to the angular distance between the word and sentence embeddings. Our approach beats Arora et al.’s by up to 44.4% on textual similarity tasks and is competitive with state-of-the-art methods. Unlike Arora et al.’s method, ours requires no hyperparameter tuning, which means it can be used when there is no labelled data.
@@ -124,7 +124,7 @@ The model has the following performance metrics:
 
 | Word level accuracy | Sentence level accuracy |
 | ------------------- | ----------------------- |
-| 97.84 ± 0.18%       | 94.68 ± 0.42%           |
+| 97.78 ± 0.18%       | 94.50 ± 0.42%           |
 
 These metrics were determined by executing 20 training/evaluation cycles and calculating the mean and standard deviation for the two metrics across all cycles. The uncertainty values provided represent the 99.7% confidence bounds (i.e. 3x standard deviation). The uncertainty is due to the randomisation of the selection of training and evaluation data whenever the model is trained.