A comprehensive implementation of safety techniques for Large Language Models (LLMs), focusing on red teaming, safety classification, and mitigation strategies.
This project explores how Large Language Models can be prompted to generate unsafe, harmful, or biased outputs and implements techniques to detect and mitigate such behaviors. The project is structured in three main phases:
- Red Teaming: Generate adversarial prompts and collect potentially harmful responses
- Safety Classification: Train a classifier to detect unsafe outputs
- Mitigation: Implement and evaluate various safety mitigation techniques
aims_r2_task_llmsafety/
βββ main.py # Main execution script
βββ requirements.txt # Python dependencies
βββ README.md # This file
βββ data/ # Generated datasets and results
βββ models/ # Trained models
βββ results/ # Evaluation results and reports
βββ visualizations/ # Generated plots and charts
βββ src/
βββ red_teaming/
β βββ prompt_generator.py # Adversarial prompt generation
β βββ response_collector.py # LLM response collection
β βββ labeling_system.py # Safety labeling system
βββ safety_filter_classifier/
β βββ classifier.py # Safety classifier model
β βββ training.py # Training pipeline
β βββ evaluation.py # Evaluation utilities
βββ mitigation/
β βββ rejection_sampling.py # Rejection sampling mitigation
β βββ chain_of_thought_moderator.py # CoT-based moderation
β βββ prompt_updater.py # Adversarial prompt transformation
βββ utils/
βββ data_processing.py # Data utilities and processing
- Python 3.8+
- CUDA-compatible GPU (recommended)
- 8GB+ RAM
- Clone the repository:
git clone <repository-url>
cd aims_r2_task_llmsafety
- Install dependencies:
pip install -r requirements.txt
Execute the entire pipeline:
python main.py --phase all --num_prompts 500
Or run individual phases:
# Red teaming only
python main.py --phase red_team --num_prompts 1000
# Training only (requires existing dataset)
python main.py --phase train
# Mitigation testing only (requires trained classifier)
python main.py --phase mitigate
# Generate report only
python main.py --phase report
The red teaming module generates adversarial prompts designed to elicit unsafe responses:
- Bias Prompts: Target demographic stereotyping and unfair generalizations
- Harm Prompts: Seek instructions for harmful or dangerous activities
- Jailbreak Prompts: Attempt to bypass safety guidelines through roleplay or manipulation
Key features:
- 15+ adversarial prompt categories
- Automated response collection from open-source LLMs
- Rule-based and heuristic safety labeling
- Export capabilities for manual annotation
A transformer-based classifier to detect unsafe LLM outputs:
- Architecture: DistilBERT-based classifier with 3 safety categories
- Categories: Safe, Biased/Harmful, Unsafe/Abusive
- Training: Automated training pipeline with validation and early stopping
- Evaluation: Comprehensive metrics including confusion matrix and per-class analysis
Four different mitigation strategies are implemented and evaluated:
- Generates multiple response candidates
- Selects the safest response using the safety classifier
- Adaptive version adjusts generation parameters based on prompt risk
- Step-by-step reasoning about safety concerns
- Explicit assessment of potential harms
- Transparent decision-making process
- Identifies and neutralizes adversarial prompt patterns
- Rule-based transformations for common attack vectors
- Neural paraphrasing for sophisticated prompt rewriting
- Combines multiple mitigation approaches
- Adaptive selection based on prompt characteristics
- Configurable safety thresholds
# Different base models can be used
GENERATION_MODEL = "microsoft/DialoGPT-medium" # or "gpt2", "facebook/blenderbot-400M-distill"
CLASSIFIER_MODEL = "distilbert-base-uncased" # or "roberta-base", "bert-base-uncased"
SAFETY_THRESHOLD = 0.7 # Minimum confidence for safe classification
MAX_GENERATION_ATTEMPTS = 10 # Maximum attempts in rejection sampling
The project generates comprehensive evaluation reports including:
- Dataset Statistics: Distribution of prompt categories and safety labels
- Classifier Performance: Accuracy, precision, recall, F1 scores per class
- Mitigation Effectiveness: Success rates and safety improvements for each technique
- Comparative Analysis: Side-by-side evaluation of different mitigation strategies
Results are saved in multiple formats:
- JSON files for programmatic analysis
- Visualizations (confusion matrices, training curves, distribution plots)
- Markdown reports for human review
This project is designed for research purposes to improve LLM safety. Key safety measures:
- Responsible Disclosure: Adversarial prompts use placeholders and safe substitutions
- No Real Harm: All examples avoid actual harmful content
- Educational Focus: Emphasis on understanding and prevention
- Ethical Guidelines: Following responsible AI research practices
This codebase supports research in:
- Adversarial ML: Understanding prompt-based attacks on LLMs
- Safety Alignment: Developing robust safety measures for AI systems
- Red Teaming: Systematic vulnerability assessment
- Mitigation Strategies: Comparative evaluation of safety techniques
from src.red_teaming.prompt_generator import AdversarialPromptGenerator
from src.red_teaming.response_collector import LLMResponseCollector
generator = AdversarialPromptGenerator()
prompts = generator.generate_bias_prompts(num_prompts=100)
collector = LLMResponseCollector()
responses = collector.collect_responses(prompts)
from src.safety_filter_classifier.classifier import SafetyClassifier
classifier = SafetyClassifier.load_model('models/safety_classifier.pth')
label, confidence = classifier.predict("Your prompt here")
from src.mitigation.rejection_sampling import RejectionSampler
sampler = RejectionSampler(safety_classifier_path='models/safety_classifier.pth')
result = sampler.rejection_sample("Potentially harmful prompt")
safe_response = result['selected_response']
- Hugging Face Transformers library
- OpenAI's work on AI safety
- Research community contributions to responsible AI