-
Notifications
You must be signed in to change notification settings - Fork 4
Fairhaven2022 Linking ERG and Wordnet
Michael Wayne Goodman edited this page Jul 18, 2022
·
4 revisions
Moderator: Dan
Slides: ergwordnet.pdf
- Dan: We are aware that the ERG has a lexical coverage gap. Petter, how many does your grammar have?
- Petter: Around 75k
- Dan: Yeah, some of these grammars have had efforts of lexicon expansion. We also use techniques like POS tagging to get around the gaps. Comparing to some dictionaries with nearing 1 million entries, how to define what is a word is messy. A real issue with lexical gaps is generation; we don't have the same machinery as with parsing. We've done some work here with proper names, but it doesn't work in the general case, for reasons that Woodley or Stephan will remind me. Our POS technique generates predicates like
devined/VBD_u_unknown
with a convention on the predicate format, which we know how to deal with, but Manning, 2011 says that 97% tagging accuracy isn't great; it means about 1 in 2 sentences has an error. So I don't like this solution for parsing, and I really don't like the situation for generation. This is not a good base for downstream work. Furthermore, a word like devine has interesting syntactic behaviors which means that I need to record something in the grammar to properly handle them. - Dan: Enter the WordNet -- 155K words in 176K synsets, 207K word-sense pairs. There are other lexical resources, but I like WordNet: Francis is working on it, etc. It's an ambitious goal to map the WordNet to the ERG. I estimate about a year. If anyone has ideas about how to speed this up, I'm happy to hear them.
- Dan: There are mismatches between WordNet and the ERG's lexical entries. I'm worried about proper names. We decided a while ago to reduce the number of proper names in the ERG. They are not interesting linguistically and it's an open class. There are some things, like San Francisco which turns into a noun-noun compound of San and Francisco. So how do we decide what to include?
- Dan: Naming semantic predicates is another issue. We have the word lie as in He lies all the time. and The book lies on the table. which look the same but their past tense is different. So we'll need two different predicates for these. Unlike, for instance, the financial bank and river bank, which always behave the same.
- Dan: There are a couple of risks. First: some processing engines may not work well with 150K+ entry lexicons. John (Carrol), do you think the LKB would work with a lexicon of this size?
- John: Should work
- Dan: And Woodley? Glenn? Mike? Francis?
- All: no issue
- Glenn: One thing is the docstrings, but you can discard those when loading. The files may be large, though.
- Dan: Ok good, so let's pretend this problem doesn't exist. The next issue: the parse selection may do worse on rarer unknown words. Woodley, any thoughts on this?
- Woodley: not sure
- Emily: So you think this will cause issues with backoffs? So you might find the frequency threshold for which you want to lump together things for parse selection purposes.
- Dan: Maybe, I don't understand that though.
- Francis: But it only requires the syntactic type.
- Woodley: Even if a treebanker doesn't need to distinguish anything, the parse selection might.
- Francis: So Emily is saying that we might just replace anything that appears only once with an
UNK
token. - Dan: Ok, so there may be some techniques to deal with this.
- Glenn: What about homonyms?
- Dan: The tagger should be able to handle this using context.
Home | Forum | Discussions | Events