Foundational Papers
Paper | Tags | Venue/Source | Year | Code | Description |
---|---|---|---|---|---|
Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity | bias robustness reliability toxicity |
arXiv | 2023 | N/A | A qualitative approach for red-teaming ethical risks of ChatGPT. |
TrustLLM: Trustworthiness in Large Language Models | truthfulness safety fairness robustness privacy machine ethics |
ICML | 2024 | Official | HuggingFace | TrustLLM is a comprehensive framework for studying the trustworthiness of LLMs. |
LLM Targeted Underperformance Disproportionately Impacts Vulnerable Users | bias alignment robustness |
NeurIPS | 2024 | Poster | LLM performance degrades based on user traits: education levels, English proficiency, and country of origin. |
When Large Language Models contradict humans? Large Language Models' Sycophantic Behaviour | robustness |
arXiv | 2025 | N/A | This paper investigates sycophancy: the tendency for LLMs to generate responses that align with a user's viewpoint, even when that viewpoint is factually incorrect. T |
Paper | Tags | Venue/Source | Year | Code | Description |
---|---|---|---|---|---|
Measuring and Improving Consistency in Pretrained Language Models | factual-knowledge robustness consistency |
TACL | 2021 | N/A | Pretrained Language Models are factually inconsistent, providing different answers to paraphrased questions. This paper proposes a method to enhance their knowledge's robustness. |
Locating and Editing Factual Associations in GPT | model-editing knowledge-localization |
NeurIPS | 2022 | Official | Datasets | Factual knowledge in transformers is stored in localized, mid-layer feed-forward modules. This paper introduces a new method called ROME to surgically update these facts directly within the model. |
Self-Consistency Improves Chain of Thought Reasoning in Language Models | chain-of-thought reasoning ensembling |
ICLR | 2023 | Slides | This paper introduces self-consistency, a decoding strategy that boosts chain-of-thought performance by sampling multiple reasoning paths and choosing the most consistent final answer. |
Large Language Models Are Human-Level Prompt Engineers | prompt-engineering instruction-generation |
arXiv | 2023 | Official | Automatic Prompt Engineer (APE), a method that successfully uses a large language model to automatically generate and select optimal instructions that are often better than human-crafted prompts. |
Evaluating the Moral Beliefs Encoded in LLMs | alignment model-probing |
NeurIPS | 2023 | Official | A survey-based statistical method to probe the moral beliefs of LLMs, revealing that while models handle clear-cut cases well, they are often uncertain or inconsistent in ambiguous scenarios. |
Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting | prompt-engineering sensitivity robustness |
ICML | 2024 | N/A | Minor prompt formatting changes cause drastic and unpredictable performance swings in large language models (LLMs), undermining the reliability of comparing models using a single, fixed format. |
Curated by: Dane Williamson & Karolina Naranjo