#

ai-safety

Here are 284 public repositories matching this topic...

jphall663 / awesome-machine-learning-interpretability

A curated list of awesome responsible machine learning resources.

Updated Sep 3, 2025

PKU-Alignment / safe-rlhf

Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback

Updated Sep 8, 2025
Python

OpenLMLab / MOSS-RLHF

Secrets of RLHF in Large Language Models Part I: PPO

alignment ai-safety rlhf

Updated Mar 3, 2024
Python

cvs-health / uqlm

UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection

uncertainty-quantification uncertainty-estimation ai-safety confidence-score hallucination confidence-estimation ai-evaluation llm llm-evaluation llm-safety hallucination-evaluation hallucination-detection hallucination-mitigation llm-hallucination

Updated Sep 15, 2025
Python

Pacific-AI-Corp / langtest

Deliver safe & effective language models

nlp artificial-intelligence benchmarks benchmark-framework model-assessment ai-safety mlops responsible-ai ml-safety trustworthy-ai ethics-in-ai ml-testing large-language-models llm ai-testing llm-test llm-evaluation-toolkit llm-as-evaluator llm-testing

Updated Sep 15, 2025
Python

agencyenterprise / PromptInject

PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Safety Workshop 2022

machine-learning agi language-models ai-safety adversarial-attacks ai-alignment ml-safety gpt-3 large-language-models prompt-engineering chain-of-thought agi-alignment

Updated Feb 26, 2024
Python

tigerlab-ai / tiger

Open Source LLM toolkit to build trustworthy LLM applications. TigerArmor (AI safety), TigerRAG (embedding, RAG), TigerTune (fine-tuning)

classification data-augmentation ai-safety fine-tuning aisafety rag large-language-models llm llm-training

Updated Dec 2, 2023
Jupyter Notebook

hendrycks / ethics

Aligning AI With Shared Human Values (ICLR 2021)

ai-safety machine-ethics ml-safety ethical-ai gpt-3

Updated Apr 21, 2023
Python

ShengranHu / Thought-Cloning

[NeurIPS '23 Spotlight] Thought Cloning: Learning to Think while Acting by Imitating Human Thinking

reinforcement-learning deep-learning pytorch artificial-intelligence imitation-learning ai-safety

Updated Jun 28, 2024
Python

Jiaqi-Chen-00 / ImBD

[AAAI 2025 oral] Official repository of Imitate Before Detect: Aligning Machine Stylistic Preference for Machine-Revised Text Detection

ai-safety ai-content-detector llm-detection mechine-text-detection

Updated Apr 2, 2025
Python

normster / llm_rules

RuLES: a benchmark for evaluating rule-following in language models

ai-safety ai-security gpt-4

Updated Feb 24, 2025
Python

cvs-health / langfair

LangFair is a Python library for conducting use-case level LLM bias and fairness assessments

python ai artificial-intelligence bias fairness ai-safety fairness-testing bias-detection fairness-ai fairness-ml responsible-ai ethical-ai large-language-models llm llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Updated Sep 11, 2025
Python

WindVChen / DiffAttack

An unrestricted attack based on diffusion models that can achieve both good transferability and imperceptibility.

ai-safety diffusion-models unrestricted-attacks adverarial-attacks transferable-attacks diffusion-adversarial-attack imperceptible-attacks

Updated Oct 20, 2024
Python

Giskard-AI / awesome-ai-safety

📚 A curated list of papers & technical articles on AI Quality & Safety

Updated Apr 14, 2025

Govcraft / rust-docs-mcp-server

🦀 Prevents outdated Rust code suggestions from AI assistants. This MCP server fetches current crate docs, uses embeddings/LLMs, and provides accurate context via a tool call.

Updated Jun 23, 2025
Rust

edwinkys / phantasm

Toolkits to create a human-in-the-loop approval layer to monitor and guide AI agents workflow in real-time.

rust open-source monitoring dashboard control-flow human-computer-interaction ai-safety human-in-the-loop ai-agents automation-tools ai-security approval-workflow llm llmops llm-security

Updated Nov 28, 2024
Svelte

tomekkorbak / pretraining-with-human-feedback

Code accompanying the paper Pretraining Language Models with Human Preferences

reinforcement-learning gpt language-models ai-safety ai-alignment pretraining decision-transformers rlhf

Updated Feb 13, 2024
Python

lets-make-safe-ai / make-safe-ai

How to Make Safe AI? Let's Discuss! 💡|💬|🙌|📚

ai agi artificial-intelligence artificial-general-intelligence ai-safety ai-alignment

Updated Mar 29, 2023

ryoungj / ToolEmu

[ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use

agent language-model ai-safety large-language-models prompt-engineering language-agent

Updated Mar 22, 2024
Python

PKU-Alignment / beavertails

BeaverTails is a collection of datasets designed to facilitate research on safety alignment in large language models (LLMs).

safety llama gpt datasets language-model beaver ai-safety human-feedback-data llm llms human-feedback rlhf large-language-model safe-rlhf

Updated Oct 27, 2023
Makefile

Improve this page

Add a description, image, and links to the ai-safety topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the ai-safety topic, visit your repo's landing page and select "manage topics."