Leichte Sprache Compound Segmentation Dataset

Dataset containing a sample of sentences in German Easy Language (Leichte Sprache) with segmented compounds paired with their unsegmented versions.

Format

Columns:

id: Id of the example.
source: Source of the text.
text_compounds_merged: Text of the examples, where the split compounds have been merged, according to the heuristic we used to merge the compounds.
text_compounds_split: Text of the examples where compound nouns were not merged. However, we did merge verbs, e.g. "heraus-finden".
split_compounds: Compounds with splits in text_compounds_split, identified by our heuristic.
unsplit_compounds: Compounds in text_compounds_split that were not split. (The original text contain some unsplit compounds, mostly simple/common words like "Mitglied" and "Gegenteil").

Citation

If you use this dataset, please cite our publication:

Automatic Compound Segmentation for Leichte Sprache
Jesús Calvillo, Umesh Patil, Johann Seltmann, Anne-Kathrin Schumann
Proceedings of the KlarText Workshop on German Text Simplification & Readability Assessment, KONVENS 2025 (workshop proceedings, TBD)

Abstract:
In German "Easy Language" (Leichte Sprache), complex compound words are often orthographically segmented to facilitate perception and processing by marking their internal structure. This practice has been shown to facilitate reading comprehension, especially for readers with cognitive or reading impairments. We present a lightweight model that combines Compound Segmentation with Complex Word Identification (CWI) to automatically detect and split difficult compounds in text. We evaluate our system both on general compound segmentation and in the specific context of Leichte Sprache. Our results show that our model achieves high segmentation accuracy, outperforming both rule-based and much larger neural systems, identifying which compounds should be segmented. We also release a new evaluation dataset of Leichte Sprache sentences with segmented compounds.

For now, please use the following placeholder BibTeX entry:

@inproceedings{calvillo2025_compseg,
  title     = {Automatic Compound Segmentation for Leichte Sprache},
  author    = {Jes{\'u}s Calvillo and Umesh Patil and Johann Seltmann and Anne-Kathrin Schumann},
  booktitle = {Proceedings of the KlarText Workshop on German Text Simplification \& Readability Assessment, KONVENS 2025 (to appear)},
  year      = {2025},
  abstract  = {In German ``Easy Language'' (Leichte Sprache), complex compound words are often orthographically segmented to facilitate perception and processing by marking their internal structure. This practice has been shown to facilitate reading comprehension, especially for readers with cognitive or reading impairments. We present a lightweight model that combines Compound Segmentation with Complex Word Identification (CWI) to automatically detect and split difficult compounds in text. We evaluate our system both on general compound segmentation and in the specific context of Leichte Sprache. Our results show that our model achieves high segmentation accuracy, outperforming both rule-based and much larger neural systems, identifying which compounds should be segmented. We also release a new evaluation dataset of Leichte Sprache sentences with segmented compounds.}
}

License

This repository contains a selected sample of sentences from the Hurraki - Wörterbuch für Leichte Sprache.
It is licensed under the Creative Commons Attribution-ShareAlike 3.0 Germany (CC BY-SA 3.0 DE).
If you use this data, you must provide proper attribution and share derivative works under the same license.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
LICENSE.txt		LICENSE.txt
README.md		README.md
simple_language_compound_split_hurraki.csv		simple_language_compound_split_hurraki.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Leichte Sprache Compound Segmentation Dataset

Format

Citation

License

About

Uh oh!

Releases

Packages

License

text2knowledge/ls_comp_segmentation_hurraki

Folders and files

Latest commit

History

Repository files navigation

Leichte Sprache Compound Segmentation Dataset

Format

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages