Skip to content

EvaluationParCorp

ValiaKordoni edited this page Oct 22, 2007 · 14 revisions

Parallel Corpora for Grammar Evaluation

During our recent DELPH-IN meeting in Berlin, many of our participants have agreed on a joint exercise of creating parallel corpora/treebanks for multiple languages, in order to facilitate cross-lingual grammar evaluation. As a first step, participants may edit this page to link to their collected texts, and provide basic description of the data.

Language Participant Group Description Data Link
Catalan (ca) Barcelona TBA
Chinese (zh) Saarbrücken TBA
English (en) Stanford/Oslo TBA
French (fr) Toulouse TBA
German (de) Saarbrücken TBA
Greek, Modern (el) Saarbrücken/Athens TBA
Japanese (ja) Kyoto TBA
Korean (ko) Seoul TBA
Norwegian (no) Trondheim TBA
Portuguese (pt) Lisbon TBA
Spanish (es) Barcelona TBA
Swedish (sv) Linköping TBA

Individual Reflections

Stephan Oepen

Obviously, establishing a parallel corpus within DELPH-IN would have many advantages, and it would likely pull participants and grammar development more closely together. Each group has their own history, interests, and constraints imposed by how they fund their work; hence, it cannot be the expectation that everyone focus their efforts on the same parallel corpus, but: (a) in some cases grammarians are relatively free in deciding on their target domain, genre, et al., and then it would be great if DELPH-IN could take advantage of such freedom and have multiple efforts work on comparable data; (b) designed as resource grammars, typical DELPH-IN efforts tend to avoid specialization to a single domain or genre, thus even for a project working on its own domain, it may be beneficial to devote some additional effort on a different target text or texts, e.g. ones taken from the DELPH-IN parallel corpus; and, finally, (c) with a growing interest in machine translation among participants, it will be a lot easier to build prototype systems and compare MRSs across grammars, where there has been at least some development effort on a parallel corpus.

I would recommend the following criteria in looking for candidate texts:

  • Following the DELPH-IN philosophy, we should be able to share parallel texts freely among participants and re-distribute results; it is possible to use copyrighted material, but then we should try to obtain permission from the copyright holder to redistribute freely.
  • To establish an initial parallel DELPH-IN corpus, we do not require large amounts of text; I would estimate that an initial collection of, say, 1000 sentences (translated across many or all of the languages in DELPH-IN) could be incredibly useful. Ideally, we would draw on text where additional volume could be obtained if need be.
  • It would be nice to have the parallel corpus exemplify a certain degree of linguistic variation, e.g. include at least some interrogatives and imperatives. At the same time, the style of the corpus, ideally, should reflect current language use and preferably not correspond to a very specialized variant (e.g. Wall Street Journal English of the 1980s or, arguably, transcripts of parliament debates).
  • If at all possible, we should look for texts that already are available in multiple languages, ideally as high-quality translations from a single source. For small-ish amounts of text, it may well be feasible to contract professional translations for additional languages. As we envision a growing parallel DELPH-IN corpus, it is desirable that there are segments with different source languages, e.g. not everything should be translated from English originals.
  • Even though current grammars focus on sentential units, I believe that the successful application of DELPH-IN technology typically assumes a certain degree of fine-tuning to one target domain and genre. This is especially true in aspects of pre-processing and disambiguation, i.e. in stochastic parse selection we have reason to expect that it is better to train domain-specific models, assuming some reasonably coherent notion of domain (and genre). Hence, I would recommend looking for running text.
  • To the extent that participants are interested in applications of DELPH-IN technology, I find it attractive to search for a parallel corpus where the domain and genre potentially carry a well-defined application. Granted, in the EU at least, parliament debates tend to be translated into many languages, but newspapers typically do not get translated. Parsing the Wall Street Journal (or the like), I find an artificial task, as it is hard to point to an NLP application in this spirit.

At the Fefor and Berlin meetings, various text sources had been proposed. There is at least one open-source advocacy text, [http://www.catb.org/~esr/writings/cathedral-bazaar/ The Cathedral and the Bazar], freely available in many languages, even though it is not quite clear how to envision an application around this kind of text. Open-source software documentation is another candidate source of parallel text, and being able to process it would have obvious applied value. However, often (computer) manuals are not produced as direct translations, hence there may be limits to parallelism; plus linguistic variation may be somewhat restricted.

My personal favorite are tourism-related texts, e.g. the materials produced for international events (say The Olympics or World Cup) or large cities (Athens, Barcelona, Berlin, Lisboa, Oslo, Paris, San Francisco, Seoul, Kyoto, you name it). These are instructional texts (How to get there?, Where to stay?, How to get around?) that are often prepared and translated with great care (at great expense). The producers want such texts to be widely distributed, so getting permission should be possible. And, over time and around the world, such texts are produced in many source languages.

Clone this wiki locally