Skip to content

panalexeu/xlm-roberta-ua-distilled

Repository files navigation

xlm-roberta-ua-distilled 🇺🇦🇬🇧

Check out the model card on HF 📄

Also, try the model in action directly via the interactive demo on HF Spaces 🧪 No setup required — test its capabilities right in your browser! 💻

Playground

MTEB

As of April 17, 2025, the model achieves a rank of 43 on the MTEB leaderboard for the Ukrainian language and ranks higher than text-embedding-3-small by OpenAI, which is ranked 45th.

MTEB

Benchmarks

Below is the performance of the models measured on sts17-crosslingual-sts, using Spearman correlation between the predicted similarity scores and the gold scores.

model en-en en-ua ua-ua
multi-qa-mpnet-base-dot-v1 75.8 12.9 62.3
XLM-RoBERTa 52.2 13.5 41.5
xlm-roberta-ua-distilled* 73.1 62.0 64.5

For evaluation and benchmarking, the sts17-crosslingual-sts (semantic textual similarity) dataset was used. It consists of multilingual sentence pairs and a similarity score from 0 to 5 annotated by humans. However, the sts17-crosslingual-sts dataset does not provide sentence pairs for the Ukrainian language, so they were machine-translated using gpt-4o, resulting in en-en, en-ua, and ua-ua evaluation subsets. You can check out the translation process in more detail in the following notebook.

To see the benchmarking process in more detail, check out the following notebook.

Training Approach

To train the model, the approach proposed by Nils Reimers and Iryna Gurevych in the following research paper was used.

The idea of the approach is to distill knowledge from the teacher model to the student model, with the loss function being Mean Squared Error (MSE).

The MSE is calculated between the teacher model’s embedding of a sentence (e.g., in English) and the student model’s embedding of the same sentence in English, as well as versions of the same sentence in other languages (in our case, Ukrainian only).

https://www.sbert.net/examples/sentence_transformer/training/multilingual/README.html

In this way, the proposed approach not only distills knowledge from the teacher model to the student, but also "squeezes" the embeddings of different training languages together - which makes sense, since semantically equivalent sentences should have similar vector representations across languages.

This results in improved model performance across several training languages and better cross-lingual transfer.

SentenceTransformers library provides ready-to-use tools to implement the training process described above.

The teacher model chosen was multi-qa-mpnet-base-dot-v1. This model is monolingual (English) and achieves great performance on semantic search tasks (which is well-suited for RAG), based on the benchmarks provided here.

The student model chosen was XLM-RoBERTa. This model is a multilingual version of RoBERTa, trained on CommonCrawl data covering 100 languages.

The training was performed on the following parallel sentence datasets, specifically on the en-uk subsets:

The combined training dataset resulted in more than 500,000 sentence pairs.

The training was performed for 4 epochs with a batch size of 48. The training hardware was a GPU P100 with 16 GB of memory provided by Kaggle. On the GPU P100, training took more than 8 hours.

You can check out the training process in more detail in the following notebook.

Usage Example

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('panalexeu/xlm-roberta-ua-distilled')

sentences = [
    'I love coffee!',
    'Я люблю каву!',
    'C is a compiled programming language known for its speed and low-level memory access.',  
    'Python — це інтерпретована мова програмування, що цінується за простоту та читабельність.'
]

embeds = model.encode(sentences)
embeds.shape
#( 4, 768)

model.similarity(embeds, embeds)
# tensor([[1.0000, 0.9907, 0.3557, 0.3706],
#        [0.9907, 1.0000, 0.3653, 0.3757],
#        [0.3557, 0.3653, 1.0000, 0.7821],
#        [0.3706, 0.3757, 0.7821, 1.0000]])

A usage example is also provided as a notebook.

About

A distilled XLM-RoBERTa model fine-tuned for Ukrainian 🇺🇦 sentence embeddings.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published