Skip to content

Across_Framework_Evaluation_Metrics

RebeccaDridan edited this page Sep 5, 2007 · 7 revisions

Minutes from the Across Framework Evaluation Metrics discussion on 23/08/07.

Intro: [http://www.coli.uni-saarland.de/~rdrid/delphinsummit/DELPH-INevaluation.ppt Yusuke Miyao's slides]

Parse Evaluation

In discussion about the difficulty of format conversion, it was pointed out that creating a new gold standard was also difficult and time consuming (consuming much more time). However the conversion process might never get past 70 or 80%, no matter how much time is expended.

Dan's suggestion: assemble a test set and get all (?) deep NLP communities to create a gold standard in their formalism. Then get together and discuss where we agree on analysis, and at what level. Sentences/phenomena that we can't get agreement on will be removed from the released standard. Create annotation (on different levels?) that represent the analysis we agree on. Avoids the long tail, "linguistically uninteresting" phenomena can be left til after we have a base...

Comments about the motivation supplied by having competition. GR has motivated a lot of comparison and work. We want a similar effect but capable of showing more complex information.

If we can measure "something", then we can see (and show) how abstract/grammar/parse improvements effect applications.

Lots of conversation about what data set we should use.

Important (for various reasons) that everyone can do well, by some measurement, on the test set. (Or else no one will use it)

WSJ
Clone this wiki locally