We recommend using Python 3.10 or higher for best compatibility. To install all required dependencies, run:
pip install -r requirements.txt
- Ukrainian Lexical Stress Prediction Model
- Ukrainian Phonemizer
- Ukrainian Lexical Stress Benchmark
- Wav2Vec2 with Lexical Stress
We provide a ByT5-based grapheme-to-phoneme model specialized for predicting lexical stress in Ukrainian words.
from src.accentor import UkrainianStressifier
stressifier = UkrainianStressifier()
print(stressifier.apply_stress_marks("Привіт, як у тебе справи?"))
- Architecture: ByT5 Grapheme-to-Phoneme model
- Training Data: Voice of America corpus, annotated with stress marks by an ASR Wav2Vec2 model
The Ukrainian Phonemizer converts Ukrainian text into phonemes.
from src.phonemizer import UkrainianPhonemizer
phonemizer = UkrainianPhonemizer()
print(phonemizer.phonemize("привіт світе"))
The Ukrainian Lexical Stress Benchmark is a manually annotated dataset created to evaluate lexical stress prediction systems in context.
Dataset location:
lexical_stress_benchmark/data/lexical_stress_dataset.csv
Each sentence marks stress with a +
immediately after the stressed vowel. It contains columns:
StressedSentence
: Sentence with stress annotationsSource
: Origin (wiki
,plug
, orcustom
)
У+ ва+зі стоя+ли кві+ти.,custom
Statistic | Count |
---|---|
Total sentences | 1,026 |
Unique word forms (incl. inflections, derivations) | 6,439 |
Unique words with stress ambiguity (meaning or inflections) | 640 |
Unique words with ≥2 stress forms in dataset | 296 |
- Wikipedia (300 sentences) — formal encyclopedic style
- Pluperfect GRAC (438 sentences) — fiction, journalism, poetry
- Custom (288 sentences) — manually balanced for ambiguous stress patterns
- Word-Level Accuracy
- Sentence-Level Accuracy
- Unambiguous Word Accuracy
- Ambiguous Word Accuracy
- Macro-Average F1 (Ambiguous Word Pairs)
from lexical_stress_benchmark.benchmark import evaluate_stressification
def custom_stressify(text):
"""
Add '+' after the stressed vowel in each stressed word.
"""
# your implementation here
return text
accuracies = evaluate_stressification(custom_stressify)
for metric, value in accuracies.items():
print(f"{metric:40} {value * 100:.2f}%")
This model transcribes Ukrainian speech including lexical stress marks directly in the transcription.
- Fine-tuned model on Hugging Face: mouseyy/uk_wav2vec2_with_stress_mark
- Training data: Common Voice corpus annotated with lexical stress from Ukrainian Word Stress and Ukrainian Accentor
- Common Voice: Rosana Ardila et al., LREC 2020 https://aclanthology.org/2020.lrec-1.520/
- Voice of America ASR: Yehor Smoliakov, 2022. Zenodo DOI
- PluG Corpus: https://github.com/Dandelliony/pluperfect_grac
- Wikimedia Dumps: https://dumps.wikimedia.org
- Dictionaries of Ukraine Online: https://lcorp.ulif.org.ua/dictua/