Skip to content

DEFI-COLaF/odt-benchmark-recital2025

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ODT - Occitan Dialect Translation Benchmark

Code and data associated to the following paper presented at CORIA-TALN-RECITAL 2025:

La traduction automatique dialectale: état de l'art et étude préliminaire sur le continuum dialectal de l'occitan (Nédey, 2025)

Please cite it if you want to reuse code, results or analyses from this repo !

@inproceedings{Nedey:CORIA-TALN:2025,
    author = "N\'edey, Oriane",
    title = "La traduction automatique dialectale: \'etat de l'art et \'etude pr\'eliminaire sur le continuum dialectal de l'occitan",
    booktitle = "Actes de CORIA-TALN-RJCRI-RECITAL 2025. Actes des 18e Rencontres Jeunes Chercheurs en RI (RJCRI) et 27\`eme Rencontre des \'Etudiants Chercheurs  en Informatique pour le Traitement Automatique des Langues (RECITAL)",
    month = "6",
    year = "2025",
    address = "Marseille, France",
    publisher = "Association pour le Traitement Automatique des Langues",
    pages = "190-238",
    note = "",
    abstract = "Cet article dresse un \'etat de l'art de la traduction automatique et de son \'evaluation pour les langues \`a variation dialectale, et en particulier pour les continuums dialectaux. Pour illustrer cet \'etat de l'art, nous proposons une s\'erie d'exp\'eriences pr\'eliminaires sur le continuum occitan, afin de dresser un \'etat des performances des syst\`emes existants pour la traduction depuis et vers plusieurs vari\'et\'es d'occitan. Nos r\'esultats indiquent d'une part des performances globalement satisfaisantes pour la traduction vers le fran\c{c}ais et l'anglais. D'autre part, des analyses m\'elang\'ees \`a des outils d'identification de langues sur les pr\'edictions vers l'occitan mettent en lumi\`ere la capacit\'e de la plupart des syst\`emes \'evalu\'es \`a g\'en\'erer des textes dans cette langue (y compris en zero-shot ), mais r\'ev\`elent aussi des limitations en termes d'\'evaluation de la diversit\'e dialectale dans les traductions propos\'ees.",
    keywords = "traduction automatique, occitan, \'evaluation, langues peu dot\'ees, dialectes.",
    url = "https://talnarchives.atala.org/RECITAL/RECITAL-2025/130.pdf"
}

How to reproduce

Installing requirements

Install requirements with the following command:

pip install -r requirements.txt

Prepare data

The experiments have been conducted on 2 datasets: Flores and LoCongresNews.

You can download the Flores dataset using the script at data/flores/dl_flores.py. Beware, you might need to contact the authors of the Aranese version to get the version with corrected alignments. In that case, you can use the notebook at data/flores/flores_devtest/replace_aranese_in_flores_odt_devtest.ipynb to map the realigned Aranese devtest file with the rest of the downloaded testset in TSV format.

The prepared Flores test set should be a TSV with the following columns:

  • text_variety: text in Occitan
  • text_par_1: text in French
  • text_par_2: text in English
  • variety: name of the Occitan variety
  • lang_par_1: language code for the text in 'text_par_1' (french)
  • lang_par_2: language code for the text in 'text_par_2' (english)
  • contrastive_sample_id: index to create a subset of contrastive samples (not used)

The corpus LoCongresNews was preprocessed using the notebook data/LoCongresNews/LoCongresNewsMT.ipynb. You should change the path to the original corpus (downloaded here) at the beginning of the notebook to run it.

The prepared LoCongresNews corpus should be a TSV with the following columns:

  • text_variety: text in Occitan
  • text_par_1: text in French
  • variety: name of the Occitan variety
  • lang_par_1: language code for the text in 'text_par_1' (fr)

For few-shot prompting experiments, the few-shots used are stored in data/few_shots.

Run predictions and evaluations with MT metrics

Predictions and evaluation with MT metrics can be run either at once or in multiple calls using the script run_benchmark.py.

The basic command is:

python run_benchmark.py \
  [testset_path] \
  -c odt_benchmark_config.yaml \
  -fs [fewshots_file]

Other options are available, see python run_benchmark.py -h.

In particular, it is recommended to call the script several times to use the right resources (esp. GPUs) for the right models. To that end, the options --model_regex, --pred_only (run only predictions) and --eval_only (run only evaluations) should be particularly useful.

The only exception is Google Translate, for which you need to compute the predictions using external scripts (in folder scripts_google_translate/) before running run_benchmark.py. Note also that these translations are done using Google Cloud Translate API, which is not free of charge (unless you get a free trial as we did at the time the experiments were run).

  1. Set up the Google Cloud API on your machine
  2. Export the ID of your Google Cloud project in the environment variable GOOGLE_CLOUD_PROJECT
  3. Run translate_google.py (see -h for the options) on an input file with one segment to translate per line.
    • In our experiments, we ran translations for the directions 'oc-fr', 'fr-oc', 'oc-en' and 'fr-en'.

For Flores, as the prepared testset contains duplicates of the French/English segments to alternate etween the 'occitan' and the 'aranese' varieties, you should pass to Google Translate a file without duplicates for the directions fr-en and fr-oc, and reduplicate afterwards with the script duplicate_preds.py.

Predict and analyze results with language identification (LID) models

You can look at and/or rerun the notebook data/analyze_preds.ipynb to run predictions with LID models, as well as the corresponding corpus-level metrics such as LID mean and Mean Squared Error (MSE). The notebook also contains some nice plots to visualize results and interact with individual samples.

Download predictions

In order to enable further analysis on the outputs of the benchmarked models without the need to recompute them (for reasons like time, money (esp. Google Translate), energy and potentially reproducibility issues), we publish the predictions as zip files, each containing one file per model:

  • data/preds: TSV files with columns for candidates, references, sources
  • data/cache: TSV files with candidate predictions only
  • data/lid_preds: DataFrames stored with pickle, containing LID scores for each candidate translation

Contact

Oriane Nédey : oriane (dot) nedey (at) inria (dot) fr

Don't hesitate to reach to me if you want to discuss matters related to NLP for dialectal continua (esp. Machine Translation and Occitan NLP) !

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published