GitHub - panalexeu/xlm-roberta-ua-distilled: A distilled XLM-RoBERTa model fine-tuned for Ukrainian 🇺🇦 sentence embeddings.

xlm-roberta-ua-distilled 🇺🇦🇬🇧

Check out the model card on HF 📄

Also, try the model in action directly via the interactive demo on HF Spaces 🧪 No setup required — test its capabilities right in your browser! 💻

MTEB

As of April 17, 2025, the model achieves a rank of 43 on the MTEB leaderboard for the Ukrainian language and ranks higher than text-embedding-3-small by OpenAI, which is ranked 45th.

Benchmarks

Below is the performance of the models measured on sts17-crosslingual-sts, using Spearman correlation between the predicted similarity scores and the gold scores.

model	en-en	en-ua	ua-ua
multi-qa-mpnet-base-dot-v1	75.8	12.9	62.3
XLM-RoBERTa	52.2	13.5	41.5
xlm-roberta-ua-distilled*	73.1	62.0	64.5

For evaluation and benchmarking, the sts17-crosslingual-sts (semantic textual similarity) dataset was used. It consists of multilingual sentence pairs and a similarity score from 0 to 5 annotated by humans. However, the sts17-crosslingual-sts dataset does not provide sentence pairs for the Ukrainian language, so they were machine-translated using gpt-4o, resulting in en-en, en-ua, and ua-ua evaluation subsets. You can check out the translation process in more detail in the following notebook.

To see the benchmarking process in more detail, check out the following notebook.

Training Approach

To train the model, the approach proposed by Nils Reimers and Iryna Gurevych in the following research paper was used.

The idea of the approach is to distill knowledge from the teacher model to the student model, with the loss function being Mean Squared Error (MSE).

The MSE is calculated between the teacher model’s embedding of a sentence (e.g., in English) and the student model’s embedding of the same sentence in English, as well as versions of the same sentence in other languages (in our case, Ukrainian only).

In this way, the proposed approach not only distills knowledge from the teacher model to the student, but also "squeezes" the embeddings of different training languages together - which makes sense, since semantically equivalent sentences should have similar vector representations across languages.

This results in improved model performance across several training languages and better cross-lingual transfer.

SentenceTransformers library provides ready-to-use tools to implement the training process described above.

The teacher model chosen was multi-qa-mpnet-base-dot-v1. This model is monolingual (English) and achieves great performance on semantic search tasks (which is well-suited for RAG), based on the benchmarks provided here.

The student model chosen was XLM-RoBERTa. This model is a multilingual version of RoBERTa, trained on CommonCrawl data covering 100 languages.

The training was performed on the following parallel sentence datasets, specifically on the en-uk subsets:

The combined training dataset resulted in more than 500,000 sentence pairs.

The training was performed for 4 epochs with a batch size of 48. The training hardware was a GPU P100 with 16 GB of memory provided by Kaggle. On the GPU P100, training took more than 8 hours.

You can check out the training process in more detail in the following notebook.

Usage Example

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('panalexeu/xlm-roberta-ua-distilled')

sentences = [
    'I love coffee!',
    'Я люблю каву!',
    'C is a compiled programming language known for its speed and low-level memory access.',  
    'Python — це інтерпретована мова програмування, що цінується за простоту та читабельність.'
]

embeds = model.encode(sentences)
embeds.shape
#( 4, 768)

model.similarity(embeds, embeds)
# tensor([[1.0000, 0.9907, 0.3557, 0.3706],
#        [0.9907, 1.0000, 0.3653, 0.3757],
#        [0.3557, 0.3653, 1.0000, 0.7821],
#        [0.3706, 0.3757, 0.7821, 1.0000]])

A usage example is also provided as a notebook.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
datasets		datasets
pics		pics
researches		researches
results/panalexeu__xlm-roberta-ua-distilled/9216f50d76b032350ca312246fa2f5dcaa6ca971		results/panalexeu__xlm-roberta-ua-distilled/9216f50d76b032350ca312246fa2f5dcaa6ca971
.env.template		.env.template
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

xlm-roberta-ua-distilled 🇺🇦🇬🇧

MTEB

Benchmarks

Training Approach

Usage Example

About

Uh oh!

Releases

Packages

Languages

License

panalexeu/xlm-roberta-ua-distilled

Folders and files

Latest commit

History

Repository files navigation

xlm-roberta-ua-distilled 🇺🇦🇬🇧

MTEB

Benchmarks

Training Approach

Usage Example

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages