This repository contains the code and data used for the paper Standardize: Aligning Language Models with Expert-Defined Standards for Content Generation by Joseph Imperial, Gail Forey, and Harish Tayyar Madabushi accepted to EMNLP 2024 (Main).
The paper makes use of the following existing datasets which can be found in the data folder. Please cite the associated papers when using the preprocessed data in this work.
-
European Language Grid (ELG) -
elg_data.csv
contains CEFR-labelled narratives used as prompts for Task 1 in Section 6. Data can be downloaded here. Citation found below.Breuker, M. (2022). CEFR Labelling and Assessment Services. In European Language Grid: A Language Technology Platform for Multilingual Europe (pp. 277-282). Cham: Springer International Publishing.
-
Cambridge Exams - CEFR-labelled exam narratives used as gold-standard reference where linguistic features are extracted for the
Standardize
framework, particularly with throughStandardize-L
. The raw text dataset can be downloaded here. To get the linguistic features, use LFTK. Citation found below.Menglin Xia, Ekaterina Kochmar, and Ted Briscoe. 2016. Text Readability Assessment for Second Language Learners. In Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications, pages 12–22, San Diego, CA. Association for Computational Linguistics.
-
Corpus of Contemporary American English (COCA) -
coca_data.csv
contains a sample of the large COCA data used for Task 2 in Section 6. These are merely keywords used for prompting the LLM to generate narratives based on the keyword as the main topic. Citation found below.Davies, Mark. "The 385+ million word Corpus of Contemporary American English (1990–2008+): Design, architecture, and linguistic insights." International journal of corpus linguistics 14.2 (2009): 159-190.
-
CCS Exemplars - CCS-labelled stories used as gold-standard reference where linguistic features are extracted for the
Standardize
framework, particularly with throughStandardize-L
. To get the raw text data, please contact the authors from the citation below. Same with Cambridge Exams, extracting the features through LFTK is easy.Michael Flor, Beata Beigman Klebanov, and Kathleen M. Sheehan. 2013. Lexical Tightness and Text Complexity. In Proceedings of the Workshop on Natural Language Processing for Improving Textual Accessibility, pages 29–38, Atlanta, Georgia. Association for Computational Linguistics.
The paper makes use of four models listed below. Please make sure to cite the
Open -Weight Models
- Llama2-Chat (https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
- OpenChat (https://huggingface.co/openchat/openchat-3.5-0106)
- LongForm (https://huggingface.co/akoksal/LongForm-OPT-2.7B)
Closed Models
- GPT4 - The study makes use of the paid OpenAI API to access GPT-4. However, GPT-4o is now the current default model. We adivse to use this new version instead in the interest of higher capabilities and cheaper cost.
For all Hugginface models (Llama2-chat, Longform, OpenChat), you need to create a Hugginface account first and request access to these models and use your own user access token for the code.
Some parts of the code may mention the word Specgem
, this is the old name of the framework before we changed to Standardize
.
The code makes use of the following arguments and their definitions:
spec
choose either 'cefr' or 'ccs'dataset_path
provide path of prompt data as inputfile_output_name
provide any desired filename of output in csvknowledge_base_path
provide specifications from either CEFR or CCS, can be found in their respective filesclassifier_features_path
provide linguistic features from gold-standard CEFR or CCS datamodel_api_url
provide Hugginface-style link of model (ex. "meta-llama/Llama-2-7b-chat-hf")max_length
ormin_length
provide target min and max length of generated content, defaults to 300 and 30top_p
provide nucleus sampling value, defaults to 0.95auth_token
provide your Huggingface read and write access keymethod
see belowicralm_type
see below
The method
argument is dependent on the the type of model you want to use whether it comes from Hugginface of OpenAI. It can have the following values:
simple-prompt-hf
use this for the teacher-style method of prompting using HF modelssimple-prompt-openai
same as above but using OpenAI modelsic-ralm-hf
use this for Standardize-A and E using HF modelsic-ralm-openai
same as above but using Open AI modelsspecgem-hf
use this for Standardize-L and ★ (all artifacts) using HF modelsspecgem-openai
same as above but OpenAU models
The icralm_type
identifies what knowledge artifact you want to use with respect to the Standardize framework. See Section 5 of the paper.
standard
used for Standardize-A or aspect-based knowledge artifactexemplar
used for Standardize-E or exemplar-based knowledge artifactall
combinestandard
andexemplar
artifacts, use this argument if you want to use Standardize-L or ★
-
Generate using teacher-style prompting, CCS data, and Llama2 model
python script.py --dataset_path "CCS/coca_data.csv" --model_api_url "meta-llama/Llama-2-7b-chat-hf" --method "simple-prompt-hf" --file_output_name "coca_llama_7b_simpleprompt.csv" --auth_token AUTH_TOKEN --max_length 300 --knowledge_base_path "CCS/ccs_specs_finegrained.csv" --spec "ccs"
-
Generate using Standardize-A (aspect knowledge artifact), CEFR data, and Llama2 model
python script.py --dataset_path "cefr/elg_data.csv" --model_api_url "meta-llama/Llama-2-7b-chat-hf" --method "ic-ralm-hf" --file_output_name "elg_llama2_aspect.csv" --auth_token AUTH_TOKEN --max_length 300 --knowledge_base_path "cefr/cefr_specs.csv" --spec "cefr" --classifier_features_path "cefr/cambridge_all_features.csv" --icralm_type "standard"
-
Generate using Standardize-E (aspect knowledge artifact), CEFR data, and Llama2 model
python script.py --dataset_path "cefr/elg_data.csv" --model_api_url "meta-llama/Llama-2-7b-chat-hf" --method "ic-ralm-hf" --file_output_name "elg_llama2_exemplar.csv" --auth_token AUTH_TOKEN --max_length 300 --knowledge_base_path "cefr/cefr_specs.csv" --spec "cefr" --classifier_features_path "cefr/cambridge_all_features.csv" --icralm_type "exemplar"
-
Generate using Standardize-★ (all knowledge artifacts), CEFR data, and Llama2 model
python script.py --dataset_path "cefr/elg_data.csv" --model_api_url "meta-llama/Llama-2-7b-chat-hf" --method "specgem-hf" --file_output_name "elg_llama2_standardize.csv" --auth_token AUTH_TOKEN --max_length 300 --knowledge_base_path "cefr/cefr_specs.csv" --spec "cefr" --classifier_features_path "cefr/cambridge_all_features.csv" --icralm_type "all"
-
Generate using Standardize-★ (all knowledge artifacts), CEFR data, and GPT-4
python script.py --dataset_path "CEFR/elg_data.csv" --model_api_url "gpt" --method "specgem-openai" --file_output_name "elg_gpt4_standardize.csv" --auth_token AUTH_TOKEN --max_length 300 --knowledge_base_path "CEFR/cefr_specs_finegrained.csv" --spec "cefr" --classifier_features_path "CEFR/cambridge_all_features.csv" --icralm_type "all"
You can easily switch from using CEFR to CCS data and choose whichever model from HF you want to use. Most of the models are captured by the AutoTokenizer and AutoModelForCausalLM in model_utils.py
. If not, just add the specific Auto model reader.
The eval folder contains eval_script.ipynb
which is a Python notebook that contains both automatic model-based and fluency/diversity evaluation as described in Section 6.3.
For model-based evaluation, you need the cambridge_features.csv
for CEFR and commoncore_10_features_bin_with_sbert.csv
for CCS. These files contain the extracted linguistics features to train model-based classifiers Random Forest XGBoost for CEFR and CCS respectively as described in Section 6.3. You will also need to provide elg_data.csv
or coca_data.csv
which are both present in repo as well as a csv file for generation_file_name
containing the model generations you want to evaluate.
For precise accuracy, the script will output a classification_report
as a result based on model-classifier's prediction which you can get the values.
For adjacent accuracy, the code after the classification report performs this. Note that adjacent accuracy should only be used for CEFR and not CCS as this requires ordinal data.
For evaluating fluency and diversity, frugalscore
and distinct-n
is used. The formula for distinct-n
is already in the script and make sure that the frugalscore.py
is in the same folder as the eval_script
.
If you use any resource from this repository, please cite the Standardize
paper as referenced below:
@inproceedings{imperial-etal-2024-standardize,
title = "Standardize: {A}ligning {L}anguage {M}odels with {E}xpert-{D}efined {S}tandards for {C}ontent {G}eneration",
author = "Imperial, Joseph Marvin and Forey, Gail and Tayyar Madabushi, Harish",
editor = "Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung",
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2024",
address = "Miami, Florida",
publisher = "Association for Computational Linguistics"
}