Ukrainian TTS Preprocessing

We recommend using Python 3.10 or higher for best compatibility. To install all required dependencies, run:

pip install -r requirements.txt

Ukrainian Lexical Stress Prediction Model

We provide a ByT5-based grapheme-to-phoneme model specialized for predicting lexical stress in Ukrainian words.

Quickstart: Predict Lexical Stress

from src.accentor import UkrainianStressifier

stressifier = UkrainianStressifier()

print(stressifier.apply_stress_marks("Привіт, як у тебе справи?"))

Model Highlights

Architecture: ByT5 Grapheme-to-Phoneme model
Training Data: Voice of America corpus, annotated with stress marks by an ASR Wav2Vec2 model

Ukrainian Phonemizer

The Ukrainian Phonemizer converts Ukrainian text into phonemes.

Usage Example

from src.phonemizer import UkrainianPhonemizer

phonemizer = UkrainianPhonemizer()

print(phonemizer.phonemize("привіт світе"))

Ukrainian Lexical Stress Benchmark

The Ukrainian Lexical Stress Benchmark is a manually annotated dataset created to evaluate lexical stress prediction systems in context.

Dataset location:

lexical_stress_benchmark/data/lexical_stress_dataset.csv

Dataset Format

Each sentence marks stress with a + immediately after the stressed vowel. It contains columns:

StressedSentence: Sentence with stress annotations
Source: Origin (wiki, plug, or custom)

Sample Entry

У+ ва+зі стоя+ли кві+ти.,custom

Dataset Statistics

Statistic	Count
Total sentences	1,026
Unique word forms (incl. inflections, derivations)	6,439
Unique words with stress ambiguity (meaning or inflections)	640
Unique words with ≥2 stress forms in dataset	296

Sources

Wikipedia (300 sentences) — formal encyclopedic style
Pluperfect GRAC (438 sentences) — fiction, journalism, poetry
Custom (288 sentences) — manually balanced for ambiguous stress patterns

Evaluation Metrics

Word-Level Accuracy
Sentence-Level Accuracy
Unambiguous Word Accuracy
Ambiguous Word Accuracy
Macro-Average F1 (Ambiguous Word Pairs)

Quickstart: Run the Benchmark

from lexical_stress_benchmark.benchmark import evaluate_stressification

def custom_stressify(text):
    """
    Add '+' after the stressed vowel in each stressed word.
    """
    # your implementation here
    return text

accuracies = evaluate_stressification(custom_stressify)
for metric, value in accuracies.items():
    print(f"{metric:40} {value * 100:.2f}%")

Wav2Vec2 with Lexical Stress

This model transcribes Ukrainian speech including lexical stress marks directly in the transcription.

Fine-tuned model on Hugging Face: mouseyy/uk_wav2vec2_with_stress_mark
Training data: Common Voice corpus annotated with lexical stress from Ukrainian Word Stress and Ukrainian Accentor

References

Dataset Sources

Common Voice: Rosana Ardila et al., LREC 2020 https://aclanthology.org/2020.lrec-1.520/
Voice of America ASR: Yehor Smoliakov, 2022. Zenodo DOI
PluG Corpus: https://github.com/Dandelliony/pluperfect_grac
Wikimedia Dumps: https://dumps.wikimedia.org
Dictionaries of Ukraine Online: https://lcorp.ulif.org.ua/dictua/

Models

ByT5 G2P: Jian Zhu et al., Interspeech 2022 arXiv
Wav2Vec 2.0: Alexei Baevski et al., NeurIPS 2020 arXiv

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
lexical_stress_benchmark		lexical_stress_benchmark
src		src
training		training
.env.example		.env.example
.gitattributes		.gitattributes
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Ukrainian TTS Preprocessing

Contents

Ukrainian Lexical Stress Prediction Model

Quickstart: Predict Lexical Stress

Model Highlights

Ukrainian Phonemizer

Usage Example

Ukrainian Lexical Stress Benchmark

Dataset Format

Sample Entry

Dataset Statistics

Sources

Evaluation Metrics

Quickstart: Run the Benchmark

Wav2Vec2 with Lexical Stress

References

Dataset Sources

Models

About

Uh oh!

Releases

Packages

Languages

lang-uk/ukrainian-tts-preprocessing

Folders and files

Latest commit

History

Repository files navigation

Ukrainian TTS Preprocessing

Contents

Ukrainian Lexical Stress Prediction Model

Quickstart: Predict Lexical Stress

Model Highlights

Ukrainian Phonemizer

Usage Example

Ukrainian Lexical Stress Benchmark

Dataset Format

Sample Entry

Dataset Statistics

Sources

Evaluation Metrics

Quickstart: Run the Benchmark

Wav2Vec2 with Lexical Stress

References

Dataset Sources

Models

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages