Skip to content

FeforParCorp

AnetteFrank edited this page Jun 19, 2006 · 31 revisions

Parallel Corpora for Delph-In

Collections/Samples of available parallel corpora

* Europarl Corpus

- URL: http://people.csail.mit.edu/koehn/publications/europarl/

- [http://www.dfki.de/~frank/Europarl_sample Samples of Europarl Corpus]

- Languages: da, de, en, el, es, fi, fr, it, nl, pt, sv

- Size per language: 600-700k sents

- Format: currently distributed over approx. 400 files

- Alignment: implicit by basename of file and relative position in raw sentence-separated ascii files

- Todo: reformatting, preferrably in xml:

  • <sent> element with embedded elements for the different languages (da, .. sv), where each one specifies attributes for sentence length (tokens), and reference to original filename. This way, one could easily extract different testsuites (different sentence lengths, different languages, etc.)

- Please send me suggestions for formatting (frank@dfki.de)

Clone this wiki locally