This is a small treebank of grammatical examples for Karakalpak. It is part of a parallel Universal Dependencies corpus containing 148 sentences across four Turkic languages, designed to facilitate cross-linguistic research on these related languages.
The Karakalpak-TueCL treebank is part of a parallel Universal Dependencies corpus containing 148 sentences across four Turkic languages (Turkish - UD_Turkish-TueCL, Azerbaijani - UD_Azerbaijani-TueCL, Kyrgyz - UD_Kyrgyz-TueCL, and Uzbek - UD_Uzbek-TueCL), designed to facilitate cross-linguistic research on these related languages.
Total sentences: 173 Total tokens: Unique word forms (types): Unique lemmas:
The Karakalpak-TueCL treebank consists of 173 carefully selected sentences compiled from multiple sources, including the Cairo corpus (20 sentences), the UDTW23 corpus (20 sentences), and 97 additional examples illustrating specific grammatical constructions of interest. It serves as a source treebank for a parallel corpus spanning four Turkic languages from distinct branches of the family: Turkish and Azerbaijani (Oghuz), Kyrgyz (Kipchak), and Uzbek (Karluk).
The treebank includes various syntactic phenomena relevant to Turkic languages, such as pro-drop constructions, auxiliary chains, postverbal structures, and non-canonical word orders. Each sentence has been manually annotated following UD guidelines, with particular attention to morphosyntactic features that highlight both shared typological characteristics and language-specific traits. Glossing, transliteration, and translations of all sentences are provided in Azerbaijani, Turkish, Uzbek, and English as metadata to support comparative research.
Dependency relations, glossing, lemmatization, morphological features, POS tagging, tokenization, and transliteration were manually annotated.
This resource is significant as it represents the first fully aligned parallel UD treebanks for these Turkic languages, enabling systematic cross-linguistic comparisons previously hindered by the lack of parallel resources. The treebank supports research in comparative Turkic syntax, cross-lingual parsing, and language education.
- (citation)
=== Machine-readable metadata (DO NOT REMOVE!) ================================ Data available since: License: CC BY-SA 4.0 Includes text: yes Genre: grammar-examples Lemmas: manual native UPOS: manual native XPOS: not available Features: manual native Relations: manual native Contributors: Contributing: here Contact: ===============================================================================