Skip to content

kmihak/Croatian-Language-Modeling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 

Repository files navigation

Croatian-Language-Modeling

Resources for Croatian language modeling, classification and generation.

Datasets

NER

  • The hr500k training corpus contains 506,457 Croatian tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, named entities and dependency syntax. github paper
  • reldi_hr - This dataset is based on 3,871 Croatian tweets that were segmented into sentences, tokens, and annotated with normalized forms, lemmas, MULTEXT-East tags (XPOS), UPOS tags and morphological features, and named entities. huggingface paper

POS

Speech

  • The Croatian Parliamentary Spoken Dataset ParlaSpeech-HR 2.0 huggingface

Classification

QA, Winograd

  • (Multi-lingual exYU) LLM evaluation QA github

Topic Modeling

Sentiment

  • Parla sent - Sentiment identification in parliamentary proceedings in the Croatian, Bosnian, Serbian parliament paper. 6 level anotation schema.

COPA - Choice of Plausible Alternatives

Models

  • BERTić, CroSloEngBERT, XLM-RoBERTa - huggingface, part of CLASSLA project
  • YugoGPT - trained SOTA 7B LLM for Croatian, Bosnian, Serbian, Montenegrin lang github
  • HR-LLM - Trained od BERTić + mc4 paper

Corpora

  • macocu_hbs, hr_news, mC4 (Multi-lingual), hrwac, classla_hr, cc100_hr, riznica, srwac, classla_sr, cc100_sr, bswac, classla_bs, cnrwac huggingface
  • medical corpus A – MedCorA paper
  • EUR-Lex 2/2016 parallel, EUR-Lex judgments parallel, MaCoCu Croatian Web v2, Open Parallel Corpus (OPUS), OpenSubtitles 2018, Riznica v0.1, CHILDES Croatian Corpus, Croatian parliamentary debates (ParlaMint 2.1), Croatian parliamentary debates (ParlaMint 2.1, CoNLL format), Croatian Web (hrWaC 2.2, ReLDI), Croatian Web (hrWaC 2.2, RFTagger), DGT-Translation Memory parallel – Croatian, ELEXIS Croatian Web 2020, ELEXIS Croatian Web 2020 (hrTenTen20) WSD sample, The CURLICAT Croatian corpus, SketchEngine Corpora www
  • European Language Grid www

Papers

  • Agić & Ljubešić 2010 Lemmatization and Morphosyntactic Tagging of Croatian and Serbian paper
  • Samardžić 2017 Universal Dependencies for Serbian in Comparison with Croatian and Other Slavic Language paper
  • Ljubešić & Dobrovoljc 2019 What does Neural Bring? Analysing Improvements in Morphosyntactic Annotation and Lemmatisation of {S}lovenian, {C}roatian and {S}erbian paper
  • ...

Libraries

  • Terčon 2023 CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic Languages: XPOS, UPOS, FEATS, ... python paper

Projects

  • CLASSLA - CLARIN Knowledge Centre for South Slavic Languages www
  • TakeLab Retriever www

About

Resources for Croatian language modeling, classification and generatin.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published