A simple tool for analyzing how well language models handle "None of the other Answers" (NOTA) options in medical question answering, especially under Chain-of-Thought (CoT) reasoning.
This project investigates whether large language models (LLMs) like GPT, Claude, Deepseek-R1, and others can reliably identify when none of the answer choices are correct in medical multiple-choice questions. It compares performance with and without the need to recognize NOTA.
conda env create -f environment.yaml
conda activate cot-eval
Before running any experiments, add your API key to the config file at:
scripts/config.py
Then, add the model endpoints at:
scripts/src/medqa_nato.py
cd scripts/data
python3 load_data.py
cd ../src
python3 medqa_nato.py
python3 nota_accuracy_stats.py
- β Accuracy comparisons between regular CoT and NOTA conditions
- π Confidence intervals for model performance
- π§ͺ P-values for statistical significance testing
- π Question-level insights: which questions showed the biggest drops in accuracy