HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation

This repository provides the code implementation of HypoEval: Hypothesis-Guided Evaluation of Natural Language Generation, as well as zero-shot evaluators for summarization and story generation using training data from the datasets used in the paper. With only 30 human annotations for each evaluated aspect, HypoEval first generates hypotheses that are decomposed dimensions of the evaluated aspect, and then uses a checklist-like approach to combine LLM's Likert scores on each hypothesis to acquire an overall score for an evaluated text. HypoEval provides automated interpretable evaluation of natural language generation, with high alignment with human evaluations.

The hypothesis generation module of HypoEval is built upon HypoRefine and HypoGeniC, check out the repository at ChicagoHAI/hypothesis-generation.

Updates

May'25: HypoEval is now incorporated in quotient-ai/judges!

Use 0-shot Evaluators for Summarization and Story Generation

We provide 0-shot hypothesis-guided evaluators for summaries and story generations, with the hypotheses generated and selected using training data in data/.

To use the evaluator for summaries on aspect in ["coherence", "consistency", "informativeness", "fluency", "relevance"]:

from hypoeval.evaluator import SummaryEvaluator

evaluator = SummaryEvaluator(model_name=MODEL_NAME, model_path=MODEL_PATH) # (optional) specify model path for local models
evaluated_aspect = "coherence"
summary_list = ["...", "..."]
source_text_list = ["...", "..."]
evaluation_scores = evaluator.batched_evaluate(aspect=evaluated_aspect, summaries=summary_list, source_texts=source_text_list)

To use the evaluator for stories on aspect in ["coherence", "cohesiveness", "complexity", "empathy", "engagement", "grammaticality", "likability", "relevance", "surprise"]:

from hypoeval.evaluator import StoryEvaluator

evaluator = StoryEvaluator(model_name=MODEL_NAME, model_path=MODEL_PATH) # (optional) specify model path for local models
evaluated_aspect = "coherence"
story_list = ["...", "..."]
story_prompt_list = ["...", "..."]
evaluation_scores = evaluator.batched_evaluate(aspect=evaluated_aspect, stories=story_list, story_prompts=story_prompt_list)

Add new evaluated aspects for summmarization and story generation

Adding a new evaluated aspect requires a small-scale corpus of human evaluation scores on that aspect. Follow the steps below:

Preprocess the human evaluation scores similar to data/summeval/train_continuous_coherence.json or data/hanna/train_continuous_coherence.json:

# for summarization
new_human_data = {"candidate_summary": summary_list, "source_text": source_text_list, "label": human_score_list}

# for story generation
new_human_data = {"story": story_list, "prompt": story_prompt_list, "label": human_score_list}

with open(f"./data/{TASK_NAME}/train_continuous_{NEW_ASPECT}.json", 'w') as file:
    json.dump(new_human_data, file)

Modify get_aspect_definition in hypoeval_reproduce/utils.py to add the definition of new aspect.
Generating hypotheses. Modify hypothesis_generation/hyporefine_pipeline.py to specify the new aspect, and then run

python hyporefine_pipeline.py --model_name MODEL_NAME --task_name TASK_NAME

Hypothesis selection. Modify hypoeval/summary_evaluate_selection.py or hypoeval/story_evaluate_selection.py to specify the new aspect, then run

python summary_evaluate_selection.py --model_name MODEL_NAME

or

python story_evaluate_selection.py --model_name MODEL_NAME

Evaluation. Follow the same steps as Use 0-shot Evaluators for Summarization and Story Generation.

Reproduce Results

We include all original data for the four datasets in data_original/ and the training data together with prompts in data/.

To reproduce results in the paper:

Generating hypotheses. Modify hypothesis_generation/hyporefine_pipeline.py to specify the evaluated aspects (e.g. coherence for SummEval), and then run

python hyporefine_pipeline.py --model_name MODEL_NAME --task_name TASK_NAME

Hypothesis selection and evaluation. Modify hypoeval_reproduce/evaluate_pipeline.py to specify the evaluated aspects and random seeds. Then run

python evaluate_pipeline.py --model_name MODEL_NAME --task_name TASK_NAME

where TASK_NAME should be in ["summeval", "newsroom", "hanna", "writingprompt"].

Citation

Please consider citing our work if it contributes to your research:

@misc{li2025hypoevalhypothesisguidedevaluationnatural,
      title={HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation}, 
      author={Mingxuan Li and Hanchen Li and Chenhao Tan},
      year={2025},
      eprint={2504.07174},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.07174}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
LLM_wrapper		LLM_wrapper
data		data
data_original		data_original
hypoeval		hypoeval
hypoeval_reproduce		hypoeval_reproduce
hypothesis_generation		hypothesis_generation
literature		literature
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
hypoeval_fig1.png		hypoeval_fig1.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation

Updates

Use 0-shot Evaluators for Summarization and Story Generation

Add new evaluated aspects for summmarization and story generation

Reproduce Results

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

ChicagoHAI/HypoEval-Gen

Folders and files

Latest commit

History

Repository files navigation

HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation

Updates

Use 0-shot Evaluators for Summarization and Story Generation

Add new evaluated aspects for summmarization and story generation

Reproduce Results

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages