Subjective Text Complexity Corpus for German [Paper]
A corpus consisting of German sentences, annotated with subjective complexity ratings by two target groups.
322 sentences annotated with complexity ratings of (1) experts and (2) non-experts on a 5-point-Likert scale (1-very easy to 5-very complex).
Data comes from DATEV, a German IT service provider in the context of German tax consultants, auditors, and lawyers. The sentences have been extracted from 232 documents regarding instructions, commentaries and descriptions which address employees of the service provider, as well as external users of the system. They often describe technical solutions to the company's products or give more detailed descriptions about law regulations affecting the company's clients.
If you find the code or dataset patch helpful, please cite the following paper:
@inproceedings{seiffe-etal-2022-subjective,
title = "Subjective Text Complexity Assessment for {G}erman",
author = {Seiffe, Laura and
Kallel, Fares and
M{\"o}ller, Sebastian and
Naderi, Babak and
Roller, Roland},
editor = "Calzolari, Nicoletta and
B{\'e}chet, Fr{\'e}d{\'e}ric and
Blache, Philippe and
Choukri, Khalid and
Cieri, Christopher and
Declerck, Thierry and
Goggi, Sara and
Isahara, Hitoshi and
Maegaard, Bente and
Mariani, Joseph and
Mazo, H{\'e}l{\`e}ne and
Odijk, Jan and
Piperidis, Stelios",
booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
month = jun,
year = "2022",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://aclanthology.org/2022.lrec-1.74/",
pages = "707--714"
}
The code is released under the under terms of the CC-BY-4.0 license.