This repository contains the dataset and code for the paper "Benchmarking Query-conditioned Natural Language Inference" (Canby et al., 2025).

Natural language inference (NLI). (a) Sentence-level NLI has a label ℓ indicating the semantic relationship between a premise sentence sp and hypothesis sentence sh. (b) Document-level NLI conditions ℓ on a premise document dp and a hypothesis document dh. (c) Query-conditioned NLI conditions label ℓi on premise document dp, hypothesis document dh, and a query qi, which indicates the aspect of the documents the semantic relationship should be based on.
- Python 3.8+
- Required API keys (OpenAI, Google AI)
- Clone this repository:
git clone https://github.com/amazon-science/Query-Conditioned-NLI.git
cd Query-Conditioned-NLI
- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install required packages:
pip install -r requirements.txt
- Set up API keys:
export OPENAI_API_KEY="your-openai-key"
export GOOGLE_API_KEY="your-google-key"
The QC-NLI dataset is located in the data/
folder and includes adaptations from four existing datasets:
Dataset | Task | Size | Label Set |
---|---|---|---|
SNLI (Bowman et al., 2015) | Image descriptions | 4,452 | entailment , not_entailment |
RobustQA (Han et al., 2023) | Inconsistent document detection | 2,578 | contradiction , not_contradiction |
RAGTruth (Niu et al., 2024) | Hallucination detection | 829 | entailment , not_entailment |
FactScore (Min et al., 2023) | Fact verification | 13,796 | entailment , not_entailment |
Use src/perform_task.py
to evaluate models on QC-NLI data:
python src/perform_task.py \
--dataset robustqa \
--prompt-type zero \
--do-merge True \
--use-query True \
--start-num 0 \
--model gpro
Parameters:
--dataset
: Dataset to use- Options:
snli
,ragtruth
,robustqa
,factscore_chatgpt
,factscore_instructgpt
,factscore_perplexityai
- Options:
--prompt-type
: Prompting strategyzero
: Zero-shot promptingfew
: Few-shot promptingqanli
: QA+NLI (question-answering followed by NLI)
--do-merge
: Mergeneutral
andcontradiction
intonot_entailment
(set toTrue
for experiments in paper)--use-query
: Include query in inference (True
/False
)--start-num
: Starting index in dataset (typically0
)--model
: Model to usegpt
: GPT-4ogpt3
: GPT-3.5-turbo-0125gpt4
: GPT-4-0613gflash
: Gemini 1.5 Flashgpro
: Gemini 1.5 Pro
Use src/perform_generations.py
to convert existing datasets into QC-NLI format:
python src/perform_generations.py \
--dataset snli \
--partition train \
--start-num 0 \
--model gpt
Parameters:
--dataset
: Source dataset- Options:
snli
,ragtruth
,robustqa
,factscore
- Options:
--partition
: Data partition to convert (valid partitions depend on dataset)- SNLI:
train
,val
,test
- RobustQA:
all
- RagTruth:
train
,test
- Factscore:
chatgpt
,instructgpt
,perplexityai
- SNLI:
--start-num
: Starting index in dataset (typically0
)--model
: Model for generation (same options as above)
To adapt a new dataset to QC-NLI format:
- Create a class extending
ExampleGenerator
insrc/generator.py
- Implement the required methods:
read_data(self)
: Load your datasetgenerate(self, idx)
: Convert theidx
th data example to QC-NLI format
Example structure:
class YourDatasetGenerator(ExampleGenerator):
def __init__(self, **kwargs):
self.dname = 'your-dataset-name'
super().__init__(**kwargs)
def read_data(self):
# Load your dataset
pass
def generate(self, idx):
# Convert to QC-NLI format
pass
Coming soon!
This library is licensed under the CC-By-4.0 License.
See CONTRIBUTING for more information.
For questions or issues, please contact marc.canby@gmail.com
or open an issue on GitHub.