BASSE: BAsque and Spanish Summarization Evaluation

This GitHub repository hosts the data, evaluation results, and codebase for Summarization Metrics for Spanish and Basque: Do Automatic Scores and LLM-Judges Correlate with Humans?. It additionally hosts BasqueSumm, the first collection of news documents and their corresponding subheads (often used as a proxy for summaries) for Basque.

BASSE

BASSE is a multilingual (Basque and Spanish) dataset designed primarily for the meta-evaluation of automatic summarization metrics and LLM-as-a-Judge models. We generated automatic summaries for 90 news documents in these two languages (45 each) using Anthropic's Claude, OpenAI's GPT-4o, Reka AI's Reka, Meta's Llama 3.1 Instruct and Cohere's Command R+. For each of these models, we use four different prompts (base, core, 5W1H, tldr; see paper for more details), with the goal of generating a diverse array of summaries, both regarding quality and style. We also include human-generated reference summaries for each news document.

After generating these summaries, we annotated them for Coherence, Consistency, Fluency, Relevance, and 5W1H on a 5-point Likert scale, largely following the annotation protocol from SummEval.

How to use BASSE

The BASSE dataset is contained in two JSONLines files:

To load the data, use the following code snippet:

import json

eu_data = [json.loads(line) for line in open("data/eu/BASSE.eu.jsonl")]

BASSE format

Each line is a json object with the following keys, value pairs:

"idx" (str): A unique identifier defined by the url of the original publication.
"round" (int): 0, 1, 2, or 3 - Which annotation round this example comes from: rounds 0, 1 and 2 have three annotations per criteria and three human reference summaries per document, while round 3 only has a single annotation and reference summary. See section Inter-annotator agreement.
"original_document" (str): The original news document to be summarized.
"reference_summaries" (list): The human-generated reference summaries. 3 per document in rounds 1 and 2, and a single reference summary per document for round 3.
"model_summaries" (dict): The generated summary and its human annotations on a 5-point Likert scale for the 5 criteria: Coherence, Consistency, Fluency, Relevance, and 5W1H. The model key is a combination of model and prompt, e.g "claude-base". We also include annotations for the subhead baseline ("subhead") and similarly include the evaluations of the human reference summaries are included in rounds 1 and 2 ("human-ann1", "human-ann2", "human-ann3"). Note that these only have 2 annotations per summary.

For example, the first instance in data/es/BASSE.es.jsonl would contain the following information (note that we cut off long summaries and only include two examples of model summaries, while in reality there are 24 models: 5 models x 4 prompts + 3 human refs + 1 subhead):

{
  "idx": "http://elpais.com/deportes/2019/08/17/actualidad/1566005143_044557.html",
  "round": 1,
  "original_document": "El jet lag ante Argentina , que quedó maquillado por el...",
  "reference_summaries": [
    "La selección española pierde 55-74 contra ...",
    "La selección española de baloncesto perdió ante Rusia (55.74) ...",
    "El equipo español de baloncesto perdió..."
  ],
  "model_summaries": {
    "human-ann1": {
      "summ": "La selección española pierde 55-74...",
      "anns": {
        "Coherence": [5.0, 5.0],
        "Consistency": [5.0, 5.0],
        "Fluency": [5.0, 5.0], 
        "Relevance": [5.0, 5.0], 
        "5W1H": [3.0, 4.0]
      }
    },
    ...
    "claude-base": {
      "summ": "El texto describe un partido de preparación...", 
      "anns": {
        "Coherence": [2.0, 3.0, 3.0],
        "Consistency": [4.0, 4.0, 5.0],
        "Fluency": [5.0, 5.0, 5.0],
        "Relevance": [4.0, 4.0, 3.0],
        "5W1H": [5.0, 4.0, 5.0]
      }
    },
    ...
  }
}

Codebase

Setup

We base our experimental setup on the codebases of SummEval for automatic metrics and Prometheus-Eval for LLM-as-a-Judge evaluation.

To use this repo, first clone it and install the necessary dependencies:

git clone --recursive https://github.com/hitz-zentroa/summarization.git
cd summarization
pip install -r requirements.txt

The --recursive flag ensures that SummEval is cloned into the lib directory. If you prefer, you can instead attempt to install it directly via pip:

pip install summ-eval

Environment variables

If you're planning to run automatic metrics, you need to set the following environment variable (required by ROUGE):

export ROUGE_HOME=$PWD/lib/SummEval/evaluation/summ_eval/ROUGE-1.5.5

Required assets

You will also need to download a couple of assets.

NLTK's punkt_tab resources are necessary to preprocess BASSE for automatic metric scoring. Download them by doing:

import nltk
nltk.download('punkt_tab')

FastText embeddings are used to compute ROUGE scores. Download them into assets:

wget -P assets https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.eu.300.vec.gz
wget -P assets https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.es.300.vec.gz

0. Analysis of the BASSE corpus

Inter-annotator agreement

We measured agreement between human annotators for each evaluation criterion in terms of Cohen's quadratic kappa, κ, Krippendorff's ordinal alpha, ɑ, and agreement percentage.

Use the notebook agreement.ipynb to reproduce the tables and plots included in our article, which are located under results/agreement.

Note that agreement was computed over three annotation rounds: rounds 0 and 1 involve annotations on the same set of summaries, before and after agreement discussions between annotators, respectively. Round 0 is provided separately from the BASSE corpus for reproducibility (BASSE.es.round_0.jsonl and BASSE.eu.round_0.jsonl). Round 2 covers a new set of summaries. In the paper, we refer to rounds 0, 1, and 2 as 1, 2, and 3, respectively, as a matter of style.

Quantitative and qualitative analysis

The notebook corpus.ipynb contains code for both quantitative and qualitative analysis of the BASSE corpus.

The quantitative analysis includes computing the number of documents, tokens, and sentences per document, along with vocabulary size and the number of summaries generated in the target language. These metrics are aggregated by LLM, prompt, and language. Additionally, we use SummEval’s DataStats metric to assess summary compression and novel bigram usage.

The qualitative analysis reports human ratings on the criteria Coherence, Consistency, Fluency, Relevance, and 5W1H broken down by LLM and prompt pair.

All plots and tables generated by the notebook, as included in our paper, are available under results/corpus.

1. Score summaries with automatic metrics

Following the SummEval framework, we apply a range of standard automatic metrics (some with minor adaptations or fixes provided in src/metrics), to later meta-evaluate their correlation with human judgments in Spanish and Basque.

To reproduce the metric-based evaluations, run the script:

python -m src.01_apply_automatic_metrics --language LANGUAGE --metrics METRIC [METRIC ...]

where LANGUAGE is one of {eu,es} and METRIC is one of {rouge,m_rouge_we,m_bert_score,bleu,chrf,m_meteor,cider,stats} or all to compute all implemented metrics.

The output of this script is located under pred/metrics. It consists of one CSV file per language and metric, with the following columns:

model (str): the model whose summaries have been evaluated. As with the BASSE corpus, it is a combination of the actual model name and prompt, e.g "claude-base".
metric (str): the name of the specific metric; m_bert_score, for instance, produces BertScore precision, recall and F1-score.
score (float): the score obtained by the model's summaries (usually the mean of the individual summaries' scores, but the aggregation depends on the metric).

Each CSV contains one row per model name and prompt pair.

2. Score summaries with judge LLMs

Using the Prometheus-Eval codebase, we run a range of open and proprietary LLMs to evaluate summaries across five criteria, to later meta-evaluate their correlation with human judgments in Spanish and Basque.

To reproduce the judge-based evaluations, run the script:

python -m src.02_apply_llm_as_a_judge --model MODEL --language LANGUAGE --criterion CRITERION [--num_gpus NUM_GPUS] [--use_async_lite]

where LANGUAGE is one of {eu,es} and CRITERION is one of {Coherence,Consistency,Fluency,Relevance,5W1H}. NUM_GPUS controls the number of GPUs used by vLLM to load the judge locally. Use the flag --use_async_lite instead when using a commercial LLM API for judging (see supported providers here: https://docs.litellm.ai/docs/providers).

The output of this script is located under pred/judges. It consists of one CSV file per language, judge, and criterion, with the following columns:

model (str): the model whose summaries have been evaluated. As with the BASSE corpus, it is a combination of the actual model name and prompt, e.g "claude-base".
feedback (str): the feedback given by the model to explain, support or justify the produced score.
score (float): the score obtained by the summary.

Each CSV contains one row per summary judged.

3. Compute correlations between metric/judge scores and human ratings

To reproduce the correlations between the automatic and human evaluations, run the script:

python -m src.03_compute_correlations --language LANGUAGE [--metrics METRIC [METRIC ...]] [--judges JUDGE [JUDGE ...]]

where LANGUAGE is one of {eu,es}, METRIC is the name of a metric in pred/metrics or all, and JUDGE is the name of a judge in pred/judges or all.

The output of this script is located under results/correlation. It consists of one LaTeX table per language, evaluator type (metric or judge LLM), and correlation statistic (Spearman's rank correlation coefficient, ρ, and Kendall's rank correlation coefficient, τ).

Use the notebook correlations.ipynb to reproduce the correlations plot included in our article.

BasqueSumm

BasqueSumm was automatically compiled from www.berria.eus using trafilatura to extract the texts.

Each line is a JSON object with the following keys, value pairs:

"date" (str): When the article was published, formatted as "yyyy-mm-dd".
"url" (str): The URL of the original publication.
"category" (str): the articles topic, e.g., economy, society.
"title" (str): The title of the article.
"subtitle" (str): The subtitle of the article.
"summary" (str): The combined title + subtitle, which acts as a proxy for a reference summary.
"text" (str): The news article.

Licensing

We release BASSE and BasqueSumm under a CC BY-NC-SA 4.0 license

Citation

Please cite the following paper if you use the BASSE corpus or its associated codebase:

@misc{barnes2025summarizationmetricsspanishbasque,
      title={Summarization Metrics for {S}panish and {B}asque: Do Automatic Scores and {LLM}-Judges Correlate with Humans?}, 
      author={Jeremy Barnes and Naiara Perez and Alba Bonet-Jover and Begoña Altuna},
      year={2025},
      eprint={2503.17039},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.17039}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BASSE: BAsque and Spanish Summarization Evaluation

Table of contents

BASSE

How to use BASSE

BASSE format

Codebase

Setup

Environment variables

Required assets

0. Analysis of the BASSE corpus

Inter-annotator agreement

Quantitative and qualitative analysis

1. Score summaries with automatic metrics

2. Score summaries with judge LLMs

3. Compute correlations between metric/judge scores and human ratings

BasqueSumm

Licensing

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
assets		assets
data		data
lib		lib
pred		pred
results		results
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
requirements.txt		requirements.txt

hitz-zentroa/summarization

Folders and files

Latest commit

History

Repository files navigation

BASSE: BAsque and Spanish Summarization Evaluation

Table of contents

BASSE

How to use BASSE

BASSE format

Codebase

Setup

Environment variables

Required assets

0. Analysis of the BASSE corpus

Inter-annotator agreement

Quantitative and qualitative analysis

1. Score summaries with automatic metrics

2. Score summaries with judge LLMs

3. Compute correlations between metric/judge scores and human ratings

BasqueSumm

Licensing

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages