Dataset containing a sample of sentences in German Easy Language (Leichte Sprache) with segmented compounds paired with their unsegmented versions.
Columns:
- id: Id of the example.
- source: Source of the text.
- text_compounds_merged: Text of the examples, where the split compounds have been merged, according to the heuristic we used to merge the compounds.
- text_compounds_split: Text of the examples where compound nouns were not merged. However, we did merge verbs, e.g. "heraus-finden".
- split_compounds: Compounds with splits in text_compounds_split, identified by our heuristic.
- unsplit_compounds: Compounds in text_compounds_split that were not split. (The original text contain some unsplit compounds, mostly simple/common words like "Mitglied" and "Gegenteil").
If you use this dataset, please cite our publication:
Automatic Compound Segmentation for Leichte Sprache
Jesús Calvillo, Umesh Patil, Johann Seltmann, Anne-Kathrin Schumann
Proceedings of the KlarText Workshop on German Text Simplification & Readability Assessment, KONVENS 2025 (workshop proceedings, TBD)
Abstract:
In German "Easy Language" (Leichte Sprache), complex compound words are often orthographically segmented to facilitate perception and processing by marking their internal structure. This practice has been shown to facilitate reading comprehension, especially for readers with cognitive or reading impairments. We present a lightweight model that combines Compound Segmentation with Complex Word Identification (CWI) to automatically detect and split difficult compounds in text. We evaluate our system both on general compound segmentation and in the specific context of Leichte Sprache. Our results show that our model achieves high segmentation accuracy, outperforming both rule-based and much larger neural systems, identifying which compounds should be segmented. We also release a new evaluation dataset of Leichte Sprache sentences with segmented compounds.
For now, please use the following placeholder BibTeX entry:
@inproceedings{calvillo2025_compseg,
title = {Automatic Compound Segmentation for Leichte Sprache},
author = {Jes{\'u}s Calvillo and Umesh Patil and Johann Seltmann and Anne-Kathrin Schumann},
booktitle = {Proceedings of the KlarText Workshop on German Text Simplification \& Readability Assessment, KONVENS 2025 (to appear)},
year = {2025},
abstract = {In German ``Easy Language'' (Leichte Sprache), complex compound words are often orthographically segmented to facilitate perception and processing by marking their internal structure. This practice has been shown to facilitate reading comprehension, especially for readers with cognitive or reading impairments. We present a lightweight model that combines Compound Segmentation with Complex Word Identification (CWI) to automatically detect and split difficult compounds in text. We evaluate our system both on general compound segmentation and in the specific context of Leichte Sprache. Our results show that our model achieves high segmentation accuracy, outperforming both rule-based and much larger neural systems, identifying which compounds should be segmented. We also release a new evaluation dataset of Leichte Sprache sentences with segmented compounds.}
}
This repository contains a selected sample of sentences from the Hurraki - Wörterbuch für Leichte Sprache.
It is licensed under the Creative Commons Attribution-ShareAlike 3.0 Germany (CC BY-SA 3.0 DE).
If you use this data, you must provide proper attribution and share derivative works under the same license.