-
Notifications
You must be signed in to change notification settings - Fork 4
FeforParCorp
AnetteFrank edited this page Jun 19, 2006
·
31 revisions
* Europarl Corpus
- URL: http://people.csail.mit.edu/koehn/publications/europarl/
- [http://www.dfki.de/~frank/Europarl_sample Samples of Europarl Corpus]
- Languages: da, de, en, el, es, fi, fr, it, nl, pt, sv
- Size per language: 600-700k sents
- Format: currently distributed over approx. 400 files
- Alignment: implicit by basename of file and relative position in raw sentence-separated ascii files
- Todo: reformatting, preferrably in xml:
- <sent> element with embedded elements for the different languages (da, .. sv), where each one specifies attributes for sentence length (tokens), and reference to original filename. This way, one could easily extract different testsuites (different sentence lengths, different languages, etc.)
- Please send me suggestions for formatting (frank@dfki.de)
Home | Forum | Discussions | Events