Multillingual paraphrasing of sentences

This project explores monolingual and multilingual paraphrasing across English, German, Czech, and Slovene, evaluating whether multilingual training improves paraphrase generation. Due to the lack of paraphrase datasets for low-resource languages, we generated monolingual datasets via back-translation and trained mT5 models both monolingually and multilingually.

Our results show that monolingual models performed best in their respective languages (Parascore: 0.89–0.96), while multilingual models balanced performance across languages—improving Slovene but slightly reducing English accuracy. Human evaluation confirmed that our datasets offer better lexical diversity than Tatoeba but include more noise.

Data

The repository doesn't collect any new dataset. Instead, we have decided to leverage the already existing ones. We use the ParaCrawl dataset which consists of lots of sentences in different languages. We use maching translation models from huggingface to create paraphrase data from this translation dataset. While other multilingual parallel datasets include sentence pairs within a language (i.e. paraphrases), they include only few if any of these paraphrase sentence pairs in medium resource languages like Slovene. With our approach we create similarly sized paraphrase datasets for different languages including medium resource languages by leveraging translation data, which is more widely available than paraphrase data.

Our generated data can be accessed on huggingface:

Dataset Evaluation

We evaluate the quality of our monolingual datasets via human evaluation of a dataset sample and in direct comparison to other popular paraphrase datasets. We evaluate semantic similarity and lexical divergence and calculate a score base on their combination. The human evaluation results of the 4 generated monolingual datasets are shown in the following table:

Language	Our dataset	Tatoeba
en-en	0.256	0.307
de-de	0.291	0.588
sl-sl	0.271	0.015
cs-cs	0.189	0.210

Models

We train 6 different mt5 models, one for each of the datasets we have created. We refer to these models as mono- and multilingual models, even though they are originally multilingual mt5 models, because we train them on the generated mono- and multilingual datasets.

Our trained models can be accessed on huggingface:

Model Evaluation

We used the Parascore metric to evaluate all models. The Parascore evaluation results of the 4 monolingual trained models are shown in the following table:

Language	Parascore score
en-en	0.961
de-de	0.925
sl-sl	0.890
cs-cs	0.922

The Parascore evaluation results of the 2 multilingual trained models are shown in the following table, which also shows the average scores for each of the 4 different language subparts of the test split of the multilingual dataset:

Language	Score multi-small	Score multi-all
whole test set	0.925	0.925
English part	0.938	0.939
German part	0.926	0.925
Slovene part	0.915	0.914
Czech part	0.922	0.922

Authors

Nikolay Vasilev
Jannik Weiß (YAWNICK)
Jan Jenicek (hjeni)

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
dataset-evaluation		dataset-evaluation
docs		docs
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multillingual paraphrasing of sentences

Data

Dataset Evaluation

Models

Model Evaluation

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

nikolayVv/MultiParaphrase

Folders and files

Latest commit

History

Repository files navigation

Multillingual paraphrasing of sentences

Data

Dataset Evaluation

Models

Model Evaluation

Authors

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages