You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/explanation/features.rst
+1-35Lines changed: 1 addition & 35 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -107,11 +107,7 @@ Semantic features
107
107
Semantic features are determined from the meaning of the token
108
108
In practice this means making use of word embeddings, which are a method to encode a word as a numeric vector in such a way that the vectors for words with similar meanings are clustered close together.
109
109
110
-
An embeddings model has been trained using `floret <https://github.com/explosion/floret>`_ from the same data used to train the sequence tagging model.
111
-
This model encodes words as 10-dimensional vectors (chosen to reduce the size of the model).
112
-
For each token, the corresponding 10-dimensional vector can be calculated and used as a feature.
113
-
114
-
Due to limitations of the `python-crfsuite <https://github.com/scrapinghub/python-crfsuite>`_, which cannot make use of features that are lists, each dimension of the vector is turned into a separate feature.
110
+
Currently semantic features are not used as features for the parser model, but this is being investigated.
115
111
116
112
Example
117
113
^^^^^^^
@@ -136,16 +132,6 @@ Below is an example of the features generated for one of the tokens in an ingred
136
132
'is_after_comma': False,
137
133
'is_after_plus': False,
138
134
'word_shape': 'xxx',
139
-
'v0': -0.139836490154,
140
-
'v1': 0.335813522339,
141
-
'v2': 0.772642672062,
142
-
'v3': -0.165960505605,
143
-
'v4': 0.16534408927,
144
-
'v5': -0.356404691935,
145
-
'v6': 0.335878640413,
146
-
'v7': -0.614531040192,
147
-
'v8': 0.474092006683,
148
-
'v9': -0.137665584683,
149
135
'prev_stem': '!num',
150
136
'prev_pos': 'CD+NN',
151
137
'prev_is_capitalised': False,
@@ -156,16 +142,6 @@ Below is an example of the features generated for one of the tokens in an ingred
156
142
'prev_is_after_comma': False,
157
143
'prev_is_after_plus': False,
158
144
'prev_word_shape': '!xxx',
159
-
'prev_v0': -0.228524670005,
160
-
'prev_v1': 0.118124544621,
161
-
'prev_v2': 0.474654018879,
162
-
'prev_v3': 0.006919545121,
163
-
'prev_v4': 0.293126374483,
164
-
'prev_v5': -0.280303806067,
165
-
'prev_v6': 0.479749411345,
166
-
'prev_v7': -0.370705068111,
167
-
'prev_v8': -0.055196929723,
168
-
'prev_v9': -0.28187289834,
169
145
'next_stem': 'orang',
170
146
'next_pos': 'NN+NN',
171
147
'next_is_capitalised': False,
@@ -176,16 +152,6 @@ Below is an example of the features generated for one of the tokens in an ingred
@@ -69,7 +71,8 @@ In these cases, we still want to select the correct entry.
69
71
70
72
The approach taken attempts to match ingredient names to :abbr:`FDC(Food Data Central)` entries based on semantic similarity, that is, selecting the entry that is closest in meaning to the ingredient name even where the words used are not the identical.
71
73
Two semantic matching techniques are used, based on [Ethayarajh]_ and [Morales-Garzón]_.
72
-
Both techniques make use of the word embeddings model that is also used to provide semantic features for the parser model.
74
+
Both techniques make use of a word embeddings model.
75
+
A `GloVe <https://nlp.stanford.edu/projects/glove/>`_ embeddings model trained on text from a large corpus of recipes and is used to provide the information for the semantic similarity techniques.
73
76
74
77
Unsupervised Smooth Inverse Frequency
75
78
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -82,13 +85,8 @@ This approach is applied to the descriptions for each of the :abbr:`FDC (Food Da
82
85
The best match is selected using the cosine similarity metric.
83
86
84
87
In practice, this technique is generally pretty good at finding a reasonable matching :abbr:`FDC(Food Data Central)` entry.
85
-
However, in some cases the best matching entry is not an appropriate match.
86
-
There may be two causes for this:
87
-
88
-
1. The quality of the embeddings is good enough for this technique to be more robust.
89
-
90
-
2. Not removing the common component between vectors is causing worse performance than if it was removed.
91
-
88
+
However, in some cases the match with the best score is not an appropriate match.
89
+
The reason for this is likely due to limitations in the quality of the embeddings used.
92
90
93
91
Fuzzy Document Distance
94
92
~~~~~~~~~~~~~~~~~~~~~~~
@@ -104,7 +102,7 @@ Combined
104
102
105
103
The two techniques are combined to perform the matching of an ingredient name to an :abbr:`FDC(Food Data Central)` entry.
106
104
107
-
First, :abbr:`uSIF(Unsupervised Smooth Inverse Frequency)` is used to down select a list of candidate matches from the full set of :abbr:`FDC(Food Data Central)` entries
105
+
First, :abbr:`uSIF(Unsupervised Smooth Inverse Frequency)` is used to down select a list of *n* candidate matches from the full set of :abbr:`FDC(Food Data Central)` entries
108
106
109
107
Second, the fuzzy document distance is calculated for the down selected candidate matches.
110
108
@@ -120,7 +118,7 @@ The current implementation has a some limitations.
120
118
Work is ongoing to improve this, and suggestions and contributions are welcome.
121
119
122
120
#. Enabling this functionality is much slower than when not enabled.
123
-
The foundation foods functionality is roughly 80x slower.
121
+
When enabled, parsing a sentence is roughly 75x slower than if disabled .
0 commit comments