Description
Dinomaly Integration Task - CVPR 2025 Multi-Class Framework
Implementation Overview
Integration task for Dinomaly, a CVPR 2025 minimalistic Transformer-based framework for multi-class unsupervised anomaly detection. Dinomaly achieves state-of-the-art performance with a "less is more" philosophy using pure Transformer architectures.
Paper Details
- Title: Dinomaly: The Less Is More Philosophy in Multi-Class Unsupervised Anomaly Detection
- Venue: CVPR 2025
- Repository: https://github.com/guojiajeremy/Dinomaly
- ArXiv: https://arxiv.org/abs/2405.14325
- Authors: Jia Guo et al.
Key Features & Technical Innovation
Core Philosophy: "Less Is More"
- Pure Transformer Architecture: Only Attentions and MLPs, no complex modules
- Multi-Class Unified Model: Single model for all categories vs. one-class-one-model
- Minimalistic Design: No specialized tricks or additional components
- Foundation Model Leverage: Uses DINOv2 with registers as backbone
Four Essential Components
- Foundation Transformers: DINOv2 for universal and discriminative features
- Noisy Bottleneck: Pre-existing Dropouts for noise injection
- Linear Attention: Naturally unfocused attention to prevent identity mapping
- Loose Reconstruction: Grouped layers, no point-by-point reconstruction
Implementation Tasks
1. Core Architecture Implementation
- DinormalyModel (
src/anomalib/models/image/dinomaly/
)- Pure Transformer encoder-decoder architecture
- DINOv2 foundation model integration
- Linear attention mechanism implementation
- Noisy bottleneck with strategic dropout placement
2. Foundation Model Integration
-
DINOv2 Backbone Support
class DINOv2Encoder(nn.Module): def __init__(self, model_name="dinov2_vitb14", freeze=True): # Load pre-trained DINOv2 with registers self.backbone = torch.hub.load('facebookresearch/dinov2', model_name) if freeze: self.backbone.requires_grad_(False)
-
Multiple DINOv2 Variants
- ViT-S/14 (small)
- ViT-B/14 (base) - default
- ViT-L/14 (large)
- ViT-g/14 (giant)
3. Attention Mechanisms
-
Linear Attention Implementation
class LinearAttention(nn.Module): def __init__(self, dim, heads=8): # Linear attention that naturally cannot focus # Prevents identity mapping in reconstruction
-
Attention Variants
- Standard Multi-Head Attention (baseline)
- Linear Attention (main contribution)
- Ablation study support for different attention types
4. Reconstruction Strategy
-
Loose Reconstruction Framework
- Group multiple encoder layers for reconstruction
- Discard well-reconstructed regions during training
- Layer-wise reconstruction loss weighting
- Region-aware loss masking
-
Loss Functions
class LooseReconstructionLoss(nn.Module): def __init__(self, layer_groups=3, discard_ratio=0.1): # Multi-layer grouped reconstruction # Dynamic region discarding mechanism
5. Multi-Class Training Pipeline
-
Unified Training Strategy
- Single model for all categories simultaneously
- Cross-category feature learning
- Balanced sampling across classes
- Multi-class validation metrics
-
Training Modes
# Multi-class unified setting (main) mode: unified classes: all # Conventional class-separated (comparison) mode: separated classes: per_category
6. Dataset Integration
-
Multi-Class Dataset Support
- MVTec-AD (15 classes unified)
- VisA (12 classes unified)
- Real-IAD (30 classes unified)
-
Data Processing Pipeline
- Center-crop preprocessing (following PatchCore)
- Multiple input resolution support (224, 256, 512)
- GT mask binarization:
gt[gt>0]=1
(Dinomaly approach)
7. Performance Optimization
-
Memory Efficiency
- Gradient checkpointing for large models
- Mixed precision training support
- Efficient attention computation
-
Training Stability
- Loss spike detection and handling
- Robust optimization with multiple seeds
- Learning rate scheduling
Technical Specifications
Model Architecture
class DinormalyTorchModel(nn.Module):
def __init__(
self,
backbone: str = "dinov2_vitb14",
attention_type: str = "linear", # linear, standard
layer_groups: int = 3,
dropout_rate: float = 0.1,
feature_dim: int = 768,
):
# DINOv2 encoder (frozen)
self.encoder = DINOv2Encoder(backbone)
# Transformer decoder with linear attention
self.decoder = TransformerDecoder(
attention_type=attention_type,
dropout_rate=dropout_rate
)
# Reconstruction head
self.reconstruction_head = nn.Linear(feature_dim, feature_dim)
Configuration Structure
model:
class_path: anomalib.models.Dinomaly
init_args:
backbone: dinov2_vitb14
attention_type: linear
layer_groups: 3
dropout_rate: 0.1
reconstruction_loss_weight: 1.0
data:
class_path: anomalib.data.MVTec
init_args:
mode: unified # unified or separated
center_crop: true
image_size: [256, 256]
Training Parameters (from paper)
- Optimizer: AdamW
- Learning Rate: 1e-4 with cosine scheduling
- Batch Size: 16-32 (depending on GPU memory)
- Epochs: 100-200 (early stopping based on validation)
- Input Size: 256×256 (default), supports 224-512
Performance Targets & Validation
Expected Results (CVPR 2025 paper)
- MVTec-AD Unified: AUROC 99.6% (image), 98.2% (pixel)
- VisA Unified: AUROC 98.7% (image), 97.5% (pixel)
- Real-IAD Unified: AUROC 89.3% (image), 85.1% (pixel)
Comparison Metrics
- vs. Class-Separated: Should match or exceed individual model performance
- vs. Multi-Class SOTA: Significant improvement over existing unified methods
- Memory Efficiency: Competitive with single-class methods
Ablation Studies Required
- Foundation model comparison (DINOv2 vs. ImageNet pre-trained)
- Attention mechanism ablation (Linear vs. Standard)
- Reconstruction strategy impact (Loose vs. Strict)
- Input resolution sensitivity analysis
Integration with Anomalib Framework
Lightning Module Integration
class Dinomaly(AnomalyModule):
def __init__(self, ...):
super().__init__()
self.model = DinormalyTorchModel(...)
def training_step(self, batch, batch_idx):
# Multi-class unified training
features = self.model.encode(batch["image"])
reconstruction = self.model.decode(features)
loss = self.loose_reconstruction_loss(reconstruction, features)
return loss
def validation_step(self, batch, batch_idx):
# Multi-class evaluation
anomaly_score = self.model.predict(batch["image"])
self.log_metrics(anomaly_score, batch["label"])
Code Structure
src/anomalib/models/image/dinomaly/
├── __init__.py
├── lightning_model.py # Main Dinomaly Lightning module
├── torch_model.py # Core PyTorch implementation
├── attention.py # Linear attention mechanisms
├── dinov2_backbone.py # DINOv2 integration
├── loss.py # Loose reconstruction loss
├── anomaly_map.py # Anomaly scoring
└── utils.py # Helper functions
configs/model/
├── dinomaly.yaml # Default unified config
├── dinomaly_separated.yaml # Class-separated comparison
└── dinomaly_large.yaml # Large model variant
tests/unit/models/image/dinomaly/
├── test_lightning_model.py
├── test_torch_model.py
├── test_attention.py
└── test_dinov2_backbone.py
Implementation Priority & Timeline
Phase 1: Core Implementation (High Priority)
- DINOv2 backbone integration
- Linear attention mechanism
- Basic reconstruction framework
- MVTec-AD unified training
Phase 2: Advanced Features (Medium Priority)
- Loose reconstruction optimization
- VisA and Real-IAD support
- Comprehensive ablation studies
- Performance optimization
Phase 3: Enhancement (Low Priority)
- Multiple DINOv2 model variants
- Advanced visualization tools
- Deployment optimization
- Extended documentation
Dependencies & Compatibility
- DINOv2:
torch.hub.load('facebookresearch/dinov2', model_name)
- PyTorch: >= 2 (current anomalib requirement)
- Transformers: For attention mechanisms (already in anomalib)
- No Additional Dependencies: Pure PyTorch implementation
Key Implementation Challenges
- Identity Mapping Prevention: Linear attention and loose reconstruction
- Multi-Class Balance: Ensuring fair representation across categories
- Memory Management: DINOv2 models can be large (ViT-g/14 especially)
- Training Stability: Handling loss spikes and optimization challenges
Expected Benefits for Anomalib
- SOTA Multi-Class Performance: Best-in-class unified anomaly detection
- Foundation Model Integration: Modern self-supervised backbone
- Simplified Architecture: Clean Transformer-only design
- Research Impact: CVPR 2025 accepted method with strong results
Next Steps
- Environment Setup: Clone Dinomaly repo and analyze implementation
- DINOv2 Integration: Start with backbone integration and feature extraction
- MVP Implementation: Basic encoder-decoder with linear attention
- Validation: Reproduce paper results on MVTec-AD unified setting
- Full Integration: Complete anomalib framework integration with configs and tests