Skip to content

📋 [TASK] Implement Dinomaly - CVPR 2025 #2782

@samet-akcay

Description

@samet-akcay

Dinomaly Integration Task - CVPR 2025 Multi-Class Framework

Implementation Overview

Integration task for Dinomaly, a CVPR 2025 minimalistic Transformer-based framework for multi-class unsupervised anomaly detection. Dinomaly achieves state-of-the-art performance with a "less is more" philosophy using pure Transformer architectures.

Paper Details

Key Features & Technical Innovation

Core Philosophy: "Less Is More"

  • Pure Transformer Architecture: Only Attentions and MLPs, no complex modules
  • Multi-Class Unified Model: Single model for all categories vs. one-class-one-model
  • Minimalistic Design: No specialized tricks or additional components
  • Foundation Model Leverage: Uses DINOv2 with registers as backbone

Four Essential Components

  1. Foundation Transformers: DINOv2 for universal and discriminative features
  2. Noisy Bottleneck: Pre-existing Dropouts for noise injection
  3. Linear Attention: Naturally unfocused attention to prevent identity mapping
  4. Loose Reconstruction: Grouped layers, no point-by-point reconstruction

Implementation Tasks

1. Core Architecture Implementation

  • DinormalyModel (src/anomalib/models/image/dinomaly/)
    • Pure Transformer encoder-decoder architecture
    • DINOv2 foundation model integration
    • Linear attention mechanism implementation
    • Noisy bottleneck with strategic dropout placement

2. Foundation Model Integration

  • DINOv2 Backbone Support

    class DINOv2Encoder(nn.Module):
        def __init__(self, model_name="dinov2_vitb14", freeze=True):
            # Load pre-trained DINOv2 with registers
            self.backbone = torch.hub.load('facebookresearch/dinov2', model_name)
            if freeze:
                self.backbone.requires_grad_(False)
  • Multiple DINOv2 Variants

    • ViT-S/14 (small)
    • ViT-B/14 (base) - default
    • ViT-L/14 (large)
    • ViT-g/14 (giant)

3. Attention Mechanisms

  • Linear Attention Implementation

    class LinearAttention(nn.Module):
        def __init__(self, dim, heads=8):
            # Linear attention that naturally cannot focus
            # Prevents identity mapping in reconstruction
  • Attention Variants

    • Standard Multi-Head Attention (baseline)
    • Linear Attention (main contribution)
    • Ablation study support for different attention types

4. Reconstruction Strategy

  • Loose Reconstruction Framework

    • Group multiple encoder layers for reconstruction
    • Discard well-reconstructed regions during training
    • Layer-wise reconstruction loss weighting
    • Region-aware loss masking
  • Loss Functions

    class LooseReconstructionLoss(nn.Module):
        def __init__(self, layer_groups=3, discard_ratio=0.1):
            # Multi-layer grouped reconstruction
            # Dynamic region discarding mechanism

5. Multi-Class Training Pipeline

  • Unified Training Strategy

    • Single model for all categories simultaneously
    • Cross-category feature learning
    • Balanced sampling across classes
    • Multi-class validation metrics
  • Training Modes

    # Multi-class unified setting (main)
    mode: unified
    classes: all
    
    # Conventional class-separated (comparison)
    mode: separated  
    classes: per_category

6. Dataset Integration

  • Multi-Class Dataset Support

    • MVTec-AD (15 classes unified)
    • VisA (12 classes unified)
    • Real-IAD (30 classes unified)
  • Data Processing Pipeline

    • Center-crop preprocessing (following PatchCore)
    • Multiple input resolution support (224, 256, 512)
    • GT mask binarization: gt[gt>0]=1 (Dinomaly approach)

7. Performance Optimization

  • Memory Efficiency

    • Gradient checkpointing for large models
    • Mixed precision training support
    • Efficient attention computation
  • Training Stability

    • Loss spike detection and handling
    • Robust optimization with multiple seeds
    • Learning rate scheduling

Technical Specifications

Model Architecture

class DinormalyTorchModel(nn.Module):
    def __init__(
        self,
        backbone: str = "dinov2_vitb14",
        attention_type: str = "linear",  # linear, standard
        layer_groups: int = 3,
        dropout_rate: float = 0.1,
        feature_dim: int = 768,
    ):
        # DINOv2 encoder (frozen)
        self.encoder = DINOv2Encoder(backbone)
        
        # Transformer decoder with linear attention
        self.decoder = TransformerDecoder(
            attention_type=attention_type,
            dropout_rate=dropout_rate
        )
        
        # Reconstruction head
        self.reconstruction_head = nn.Linear(feature_dim, feature_dim)

Configuration Structure

model:
  class_path: anomalib.models.Dinomaly
  init_args:
    backbone: dinov2_vitb14
    attention_type: linear
    layer_groups: 3
    dropout_rate: 0.1
    reconstruction_loss_weight: 1.0
    
data:
  class_path: anomalib.data.MVTec
  init_args:
    mode: unified  # unified or separated
    center_crop: true
    image_size: [256, 256]

Training Parameters (from paper)

  • Optimizer: AdamW
  • Learning Rate: 1e-4 with cosine scheduling
  • Batch Size: 16-32 (depending on GPU memory)
  • Epochs: 100-200 (early stopping based on validation)
  • Input Size: 256×256 (default), supports 224-512

Performance Targets & Validation

Expected Results (CVPR 2025 paper)

  • MVTec-AD Unified: AUROC 99.6% (image), 98.2% (pixel)
  • VisA Unified: AUROC 98.7% (image), 97.5% (pixel)
  • Real-IAD Unified: AUROC 89.3% (image), 85.1% (pixel)

Comparison Metrics

  • vs. Class-Separated: Should match or exceed individual model performance
  • vs. Multi-Class SOTA: Significant improvement over existing unified methods
  • Memory Efficiency: Competitive with single-class methods

Ablation Studies Required

  • Foundation model comparison (DINOv2 vs. ImageNet pre-trained)
  • Attention mechanism ablation (Linear vs. Standard)
  • Reconstruction strategy impact (Loose vs. Strict)
  • Input resolution sensitivity analysis

Integration with Anomalib Framework

Lightning Module Integration

class Dinomaly(AnomalyModule):
    def __init__(self, ...):
        super().__init__()
        self.model = DinormalyTorchModel(...)
        
    def training_step(self, batch, batch_idx):
        # Multi-class unified training
        features = self.model.encode(batch["image"])
        reconstruction = self.model.decode(features)
        loss = self.loose_reconstruction_loss(reconstruction, features)
        return loss
        
    def validation_step(self, batch, batch_idx):
        # Multi-class evaluation
        anomaly_score = self.model.predict(batch["image"])
        self.log_metrics(anomaly_score, batch["label"])

Code Structure

src/anomalib/models/image/dinomaly/
├── __init__.py
├── lightning_model.py      # Main Dinomaly Lightning module
├── torch_model.py          # Core PyTorch implementation  
├── attention.py            # Linear attention mechanisms
├── dinov2_backbone.py      # DINOv2 integration
├── loss.py                 # Loose reconstruction loss
├── anomaly_map.py          # Anomaly scoring
└── utils.py                # Helper functions

configs/model/
├── dinomaly.yaml           # Default unified config
├── dinomaly_separated.yaml # Class-separated comparison
└── dinomaly_large.yaml     # Large model variant

tests/unit/models/image/dinomaly/
├── test_lightning_model.py
├── test_torch_model.py
├── test_attention.py
└── test_dinov2_backbone.py

Implementation Priority & Timeline

Phase 1: Core Implementation (High Priority)

  • DINOv2 backbone integration
  • Linear attention mechanism
  • Basic reconstruction framework
  • MVTec-AD unified training

Phase 2: Advanced Features (Medium Priority)

  • Loose reconstruction optimization
  • VisA and Real-IAD support
  • Comprehensive ablation studies
  • Performance optimization

Phase 3: Enhancement (Low Priority)

  • Multiple DINOv2 model variants
  • Advanced visualization tools
  • Deployment optimization
  • Extended documentation

Dependencies & Compatibility

  • DINOv2: torch.hub.load('facebookresearch/dinov2', model_name)
  • PyTorch: >= 2 (current anomalib requirement)
  • Transformers: For attention mechanisms (already in anomalib)
  • No Additional Dependencies: Pure PyTorch implementation

Key Implementation Challenges

  1. Identity Mapping Prevention: Linear attention and loose reconstruction
  2. Multi-Class Balance: Ensuring fair representation across categories
  3. Memory Management: DINOv2 models can be large (ViT-g/14 especially)
  4. Training Stability: Handling loss spikes and optimization challenges

Expected Benefits for Anomalib

  • SOTA Multi-Class Performance: Best-in-class unified anomaly detection
  • Foundation Model Integration: Modern self-supervised backbone
  • Simplified Architecture: Clean Transformer-only design
  • Research Impact: CVPR 2025 accepted method with strong results

Next Steps

  1. Environment Setup: Clone Dinomaly repo and analyze implementation
  2. DINOv2 Integration: Start with backbone integration and feature extraction
  3. MVP Implementation: Basic encoder-decoder with linear attention
  4. Validation: Reproduce paper results on MVTec-AD unified setting
  5. Full Integration: Complete anomalib framework integration with configs and tests

Metadata

Metadata

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions