Resources for Croatian language modeling, classification and generation.
- The hr500k training corpus contains 506,457 Croatian tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, named entities and dependency syntax. github paper
- reldi_hr - This dataset is based on 3,871 Croatian tweets that were segmented into sentences, tokens, and annotated with normalized forms, lemmas, MULTEXT-East tags (XPOS), UPOS tags and morphological features, and named entities. huggingface paper
- SETimes_sr github
- The Croatian Parliamentary Spoken Dataset ParlaSpeech-HR 2.0 huggingface
- Offensive language dataset of Croatian comments FRENK 1.0 huggingface paper
- (Multi-lingual exYU) LLM evaluation QA github
- News topic modeling github
- Parla sent - Sentiment identification in parliamentary proceedings in the Croatian, Bosnian, Serbian parliament paper. 6 level anotation schema.
- The COPA-HR dataset
- BERTić, CroSloEngBERT, XLM-RoBERTa - huggingface, part of CLASSLA project
- YugoGPT - trained SOTA 7B LLM for Croatian, Bosnian, Serbian, Montenegrin lang github
- HR-LLM - Trained od BERTić + mc4 paper
- macocu_hbs, hr_news, mC4 (Multi-lingual), hrwac, classla_hr, cc100_hr, riznica, srwac, classla_sr, cc100_sr, bswac, classla_bs, cnrwac huggingface
- medical corpus A – MedCorA paper
- EUR-Lex 2/2016 parallel, EUR-Lex judgments parallel, MaCoCu Croatian Web v2, Open Parallel Corpus (OPUS), OpenSubtitles 2018, Riznica v0.1, CHILDES Croatian Corpus, Croatian parliamentary debates (ParlaMint 2.1), Croatian parliamentary debates (ParlaMint 2.1, CoNLL format), Croatian Web (hrWaC 2.2, ReLDI), Croatian Web (hrWaC 2.2, RFTagger), DGT-Translation Memory parallel – Croatian, ELEXIS Croatian Web 2020, ELEXIS Croatian Web 2020 (hrTenTen20) WSD sample, The CURLICAT Croatian corpus, SketchEngine Corpora www
- European Language Grid www
- Agić & Ljubešić 2010 Lemmatization and Morphosyntactic Tagging of Croatian and Serbian paper
- Samardžić 2017 Universal Dependencies for Serbian in Comparison with Croatian and Other Slavic Language paper
- Ljubešić & Dobrovoljc 2019 What does Neural Bring? Analysing Improvements in Morphosyntactic Annotation and Lemmatisation of {S}lovenian, {C}roatian and {S}erbian paper
- ...