ODT - Occitan Dialect Translation Benchmark

Code and data associated to the following paper presented at CORIA-TALN-RECITAL 2025:

La traduction automatique dialectale: état de l'art et étude préliminaire sur le continuum dialectal de l'occitan (Nédey, 2025)

Please cite it if you want to reuse code, results or analyses from this repo !

@inproceedings{Nedey:CORIA-TALN:2025,
    author = "N\'edey, Oriane",
    title = "La traduction automatique dialectale: \'etat de l'art et \'etude pr\'eliminaire sur le continuum dialectal de l'occitan",
    booktitle = "Actes de CORIA-TALN-RJCRI-RECITAL 2025. Actes des 18e Rencontres Jeunes Chercheurs en RI (RJCRI) et 27\`eme Rencontre des \'Etudiants Chercheurs  en Informatique pour le Traitement Automatique des Langues (RECITAL)",
    month = "6",
    year = "2025",
    address = "Marseille, France",
    publisher = "Association pour le Traitement Automatique des Langues",
    pages = "190-238",
    note = "",
    abstract = "Cet article dresse un \'etat de l'art de la traduction automatique et de son \'evaluation pour les langues \`a variation dialectale, et en particulier pour les continuums dialectaux. Pour illustrer cet \'etat de l'art, nous proposons une s\'erie d'exp\'eriences pr\'eliminaires sur le continuum occitan, afin de dresser un \'etat des performances des syst\`emes existants pour la traduction depuis et vers plusieurs vari\'et\'es d'occitan. Nos r\'esultats indiquent d'une part des performances globalement satisfaisantes pour la traduction vers le fran\c{c}ais et l'anglais. D'autre part, des analyses m\'elang\'ees \`a des outils d'identification de langues sur les pr\'edictions vers l'occitan mettent en lumi\`ere la capacit\'e de la plupart des syst\`emes \'evalu\'es \`a g\'en\'erer des textes dans cette langue (y compris en zero-shot ), mais r\'ev\`elent aussi des limitations en termes d'\'evaluation de la diversit\'e dialectale dans les traductions propos\'ees.",
    keywords = "traduction automatique, occitan, \'evaluation, langues peu dot\'ees, dialectes.",
    url = "https://talnarchives.atala.org/RECITAL/RECITAL-2025/130.pdf"
}

How to reproduce

Installing requirements

Install requirements with the following command:

pip install -r requirements.txt

Prepare data

The experiments have been conducted on 2 datasets: Flores and LoCongresNews.

You can download the Flores dataset using the script at data/flores/dl_flores.py. Beware, you might need to contact the authors of the Aranese version to get the version with corrected alignments. In that case, you can use the notebook at data/flores/flores_devtest/replace_aranese_in_flores_odt_devtest.ipynb to map the realigned Aranese devtest file with the rest of the downloaded testset in TSV format.

The prepared Flores test set should be a TSV with the following columns:

text_variety: text in Occitan
text_par_1: text in French
text_par_2: text in English
variety: name of the Occitan variety
lang_par_1: language code for the text in 'text_par_1' (french)
lang_par_2: language code for the text in 'text_par_2' (english)
contrastive_sample_id: index to create a subset of contrastive samples (not used)

The corpus LoCongresNews was preprocessed using the notebook data/LoCongresNews/LoCongresNewsMT.ipynb. You should change the path to the original corpus (downloaded here) at the beginning of the notebook to run it.

The prepared LoCongresNews corpus should be a TSV with the following columns:

text_variety: text in Occitan
text_par_1: text in French
variety: name of the Occitan variety
lang_par_1: language code for the text in 'text_par_1' (fr)

For few-shot prompting experiments, the few-shots used are stored in data/few_shots.

Run predictions and evaluations with MT metrics

Predictions and evaluation with MT metrics can be run either at once or in multiple calls using the script run_benchmark.py.

The basic command is:

python run_benchmark.py \
  [testset_path] \
  -c odt_benchmark_config.yaml \
  -fs [fewshots_file]

Other options are available, see python run_benchmark.py -h.

In particular, it is recommended to call the script several times to use the right resources (esp. GPUs) for the right models. To that end, the options --model_regex, --pred_only (run only predictions) and --eval_only (run only evaluations) should be particularly useful.

The only exception is Google Translate, for which you need to compute the predictions using external scripts (in folder scripts_google_translate/) before running run_benchmark.py. Note also that these translations are done using Google Cloud Translate API, which is not free of charge (unless you get a free trial as we did at the time the experiments were run).

Set up the Google Cloud API on your machine
Export the ID of your Google Cloud project in the environment variable GOOGLE_CLOUD_PROJECT
Run translate_google.py (see -h for the options) on an input file with one segment to translate per line.
- In our experiments, we ran translations for the directions 'oc-fr', 'fr-oc', 'oc-en' and 'fr-en'.

For Flores, as the prepared testset contains duplicates of the French/English segments to alternate etween the 'occitan' and the 'aranese' varieties, you should pass to Google Translate a file without duplicates for the directions fr-en and fr-oc, and reduplicate afterwards with the script duplicate_preds.py.

Predict and analyze results with language identification (LID) models

You can look at and/or rerun the notebook data/analyze_preds.ipynb to run predictions with LID models, as well as the corresponding corpus-level metrics such as LID mean and Mean Squared Error (MSE). The notebook also contains some nice plots to visualize results and interact with individual samples.

Download predictions

In order to enable further analysis on the outputs of the benchmarked models without the need to recompute them (for reasons like time, money (esp. Google Translate), energy and potentially reproducibility issues), we publish the predictions as zip files, each containing one file per model:

data/preds: TSV files with columns for candidates, references, sources
data/cache: TSV files with candidate predictions only
data/lid_preds: DataFrames stored with pickle, containing LID scores for each candidate translation

Contact

Oriane Nédey : oriane (dot) nedey (at) inria (dot) fr

Don't hesitate to reach to me if you want to discuss matters related to NLP for dialectal continua (esp. Machine Translation and Occitan NLP) !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ODT - Occitan Dialect Translation Benchmark

How to reproduce

Installing requirements

Prepare data

Run predictions and evaluations with MT metrics

Predict and analyze results with language identification (LID) models

Download predictions

Contact

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
odt_benchmark		odt_benchmark
scripts_google_translate		scripts_google_translate
README.md		README.md
requirements.txt		requirements.txt
run_benchmark.py		run_benchmark.py

DEFI-COLaF/odt-benchmark-recital2025

Folders and files

Latest commit

History

Repository files navigation

ODT - Occitan Dialect Translation Benchmark

How to reproduce

Installing requirements

Prepare data

Run predictions and evaluations with MT metrics

Predict and analyze results with language identification (LID) models

Download predictions

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages