gum_sum_salience

Data and code for ACL 2025 Findings paper: GUM-SAGE: A Novel Dataset and Approach for Graded Entity Salience Prediction

Data is also available from GUM

Prepare datasets

Download the GUM tsv and xml folders from GUM and put them in ./data. Place the train, dev, and test documents in their corresponding folders.

The ./data folder should look like this:

├── data
    └── input
        └── tsv                                   # gold tsv files from GUM (Note: use `_build/src/tsv/` files for running `serialize.py` now , use `/coref/gum/tsv/` for getting mentions)
            └── train
            └── dev
            └── test
        └── xml                                   # gold xml files from GUM
            └── train
            └── dev
            └── test
    └── output
        └── xml                                   # output xml files with multiple generated summaries
        └── tsv                                   # output tsv files with graded salience information
        └── alignment                             # alignment results from stanza, LLM, string_match
            └── stanza                            # json files with predicted salient entities using stanza
            └── LLM                               # json files with predicted salient entities using LLM (GPT4o)
            └── string_match                      # json files with predicted salient entities using string_match
        └── summaries                             # human or LLM generated summaries
            └── train                             # LLM generated summaries
                └── {model_name} folder
            └── dev                               # human crowdsourced summaries (h1~h5 folders)
            └── test                              # human crowdsourced summaries (h1~h5 folders)
        └── ensemble
            └── `graded_sal_meta_learner_dev.tsv` # training tsv file for the ensemble logistic regression model
            └── train                             # prediction tsv files obtained from `alignment` to run `ensemble.py`
            └── dev
            └── test

`main.py`

Setup argument parsing for the script
Loading documents (tsv, xml), generating summaries, parsing summaries, aligning mentions, and serializing results

`get_summary.py`

Define a function get_summary(doc_text, n=4) that interacts with APIs (Huggingface, Anthropic, OpenAI) to generate n summaries

`parse.py`

Define a function parse(summary_text) that returns a list of noun phrase (NP) strings corresponding to all nominal mention strings (excluding pronouns) using stanza

`align.py`

Define a function align(doc_mentions, summary_text, mention_text) that aligns mentions from the summary with those in the document
Use one of these components (LLM, LLM_hf, string_match, stanza) to perform the alignment

`serialize.py`

Define a function serialize(tsv, xml, alignments) that takes the alignments and produces:
- A TSV file with new annotations for salience
- An XML file with new summaries embedded in the element

`ensemble.py`

Take alignments from {string_match, stanza, LLM}, train a logistic regression model for predicting salient entities, and write the annotations to tsv files

Example:

python3 ensemble.py \
    --data_folder ./data \
    --partition train \
    --alignment_components stanza LLM string_match \
    --model_names gold gpt4o claude-3-5-sonnet-20241022 meta-llama/Llama-3.2-3B-Instruct Qwen2.5-7B-Instruct

`score.py`

Micro/Macro Precision/Recall/F1 score of salient entities (not mentions) for each one of the alignment component approaches
Default to score 'test' set

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

gum_sum_salience

Prepare datasets

`main.py`

`get_summary.py`

`parse.py`

`align.py`

`serialize.py`

`ensemble.py`

`score.py`

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 125 Commits
data		data
README.md		README.md
align.py		align.py
ensemble.py		ensemble.py
get_summary.py		get_summary.py
main.py		main.py
parse.py		parse.py
score.py		score.py
serialize.py		serialize.py

jl908069/gum_sum_salience

Folders and files

Latest commit

History

Repository files navigation

gum_sum_salience

Prepare datasets

main.py

get_summary.py

parse.py

align.py

serialize.py

ensemble.py

score.py

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

`main.py`

`get_summary.py`

`parse.py`

`align.py`

`serialize.py`

`ensemble.py`

`score.py`

Packages