Skip to content

jl908069/gum_sum_salience

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gum_sum_salience

Data and code for ACL 2025 Findings paper: GUM-SAGE: A Novel Dataset and Approach for Graded Entity Salience Prediction

Data is also available from GUM

Prepare datasets

  • Download the GUM tsv and xml folders from GUM and put them in ./data. Place the train, dev, and test documents in their corresponding folders.
  • The ./data folder should look like this:
├── data
    └── input
        └── tsv                                   # gold tsv files from GUM (Note: use `_build/src/tsv/` files for running `serialize.py` now , use `/coref/gum/tsv/` for getting mentions)
            └── train
            └── dev
            └── test
        └── xml                                   # gold xml files from GUM
            └── train
            └── dev
            └── test
    └── output
        └── xml                                   # output xml files with multiple generated summaries
        └── tsv                                   # output tsv files with graded salience information
        └── alignment                             # alignment results from stanza, LLM, string_match
            └── stanza                            # json files with predicted salient entities using stanza
            └── LLM                               # json files with predicted salient entities using LLM (GPT4o)
            └── string_match                      # json files with predicted salient entities using string_match
        └── summaries                             # human or LLM generated summaries
            └── train                             # LLM generated summaries
                └── {model_name} folder
            └── dev                               # human crowdsourced summaries (h1~h5 folders)
            └── test                              # human crowdsourced summaries (h1~h5 folders)
        └── ensemble
            └── `graded_sal_meta_learner_dev.tsv` # training tsv file for the ensemble logistic regression model
            └── train                             # prediction tsv files obtained from `alignment` to run `ensemble.py`
            └── dev
            └── test 

main.py

  • Setup argument parsing for the script
  • Loading documents (tsv, xml), generating summaries, parsing summaries, aligning mentions, and serializing results

get_summary.py

  • Define a function get_summary(doc_text, n=4) that interacts with APIs (Huggingface, Anthropic, OpenAI) to generate n summaries

parse.py

  • Define a function parse(summary_text) that returns a list of noun phrase (NP) strings corresponding to all nominal mention strings (excluding pronouns) using stanza

align.py

  • Define a function align(doc_mentions, summary_text, mention_text) that aligns mentions from the summary with those in the document
  • Use one of these components (LLM, LLM_hf, string_match, stanza) to perform the alignment

serialize.py

  • Define a function serialize(tsv, xml, alignments) that takes the alignments and produces:
    • A TSV file with new annotations for salience
    • An XML file with new summaries embedded in the element

ensemble.py

  • Take alignments from {string_match, stanza, LLM}, train a logistic regression model for predicting salient entities, and write the annotations to tsv files

  • Example:

    python3 ensemble.py \
        --data_folder ./data \
        --partition train \
        --alignment_components stanza LLM string_match \
        --model_names gold gpt4o claude-3-5-sonnet-20241022 meta-llama/Llama-3.2-3B-Instruct Qwen2.5-7B-Instruct

score.py

  • Micro/Macro Precision/Recall/F1 score of salient entities (not mentions) for each one of the alignment component approaches
  • Default to score 'test' set

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages