-
Notifications
You must be signed in to change notification settings - Fork 4
Across_Framework_Evaluation_Metrics
Minutes from the Across Framework Evaluation Metrics discussion on 23/08/07.
Intro: [http://www.coli.uni-saarland.de/~rdrid/delphinsummit/DELPH-INevaluation.ppt Yusuke Miyao's slides]
In discussion about the difficulty of format conversion, it was pointed out that creating a new gold standard was also difficult and time consuming (consuming much more time). However the conversion process might never get past 70 or 80%, no matter how much time is expended.
Dan's suggestion: assemble a test set and get all (?) deep NLP communities to create a gold standard in their formalism. Then get together and discuss where we agree on analysis, and at what level. Sentences/phenomena that we can't get agreement on will be removed from the released standard. Create annotation (on different levels?) that represent the analysis we agree on. Avoids the long tail, "linguistically uninteresting" phenomena can be left til after we have a base...
Comments about the motivation supplied by having competition. GR has motivated a lot of comparison and work. We want a similar effect but capable of showing more complex information.
If we can measure "something", then we can see (and show) how abstract/grammar/parse improvements effect applications.
Lots of conversation about what data set we should use.
Important (for various reasons) that everyone can do well, by some measurement, on the test set. (Or else no one will use it)
Advantage: Comparison with other results, particularly DepBank (from sec 23)
Disadvantage: We can't parse it at the moment. A suggestion was made to modify the sentences to parsable form to get the 'correct' analyses, but there were concerns that we would be then contaminating it against future use.
Suggestion: Use section 0, which is always used for development. We should still be able to get comparisons with other systems, without making future use of sec 23 impossible.
It was suggested that it would be good to have QA (and other application specific?) data in any new gold standard we develop, since the Penn Treebank contains very few questions.
Stephan: "parsing Wikipedia has a certain sex appeal"
Proposal to parse a selection of Wikipedia. We can select domains
Home | Forum | Discussions | Events