open-edge-platform · rajeshgangireddy · Jul 1, 2025 · Jul 1, 2025 · Jul 1, 2025 · Jul 7, 2025
@@ -0,0 +1,15 @@
+model:
+  class_path: anomalib.models.Dinomaly
+  init_args:
+    encoder_name: dinov2reg_vit_base_14
+    bottleneck_dropout: 0.2
+    decoder_depth: 8
+
+trainer:
+  max_steps: 5000
+  callbacks:
+    - class_path: lightning.pytorch.callbacks.EarlyStopping
+      init_args:
+        patience: 20
+        monitor: image_AUROC
+        mode: max
@@ -60,6 +60,7 @@
     Csflow,
     Dfkde,
     Dfm,
+    Dinomaly,
     Draem,
     Dsr,
     EfficientAd,
@@ -97,6 +98,7 @@ class UnknownModelError(ModuleNotFoundError):
     "Dfkde",
     "Dfm",
     "Draem",
+    "Dinomaly",
     "Dsr",
     "EfficientAd",
     "Fastflow",

@@ -49,6 +49,7 @@
 from .csflow import Csflow
 from .dfkde import Dfkde
 from .dfm import Dfm
+from .dinomaly import Dinomaly
 from .draem import Draem
 from .dsr import Dsr
 from .efficient_ad import EfficientAd
@@ -84,4 +85,5 @@
     "Uflow",
     "VlmAd",
     "WinClip",
+    "Dinomaly",
 ]
@@ -0,0 +1,81 @@
+# Dinomaly: Vision Transformer-based Anomaly Detection with Feature Reconstruction
+
+This is the implementation of the Dinomaly model based on the [original implementation](https://github.com/guojiajeremy/Dinomaly).
+
+Model Type: Segmentation
+
+## Description
+
+Dinomaly is a Vision Transformer-based anomaly detection model that uses an encoder-decoder architecture
+for feature reconstruction.
+The model leverages pre-trained DINOv2 Vision Transformer features and employs a reconstruction-based approach
+to detect anomalies by comparing encoder and decoder features.
+
+### Architecture
+
+The Dinomaly model consists of three main components:
+
+1. DINOv2 Encoder: A pre-trained Vision Transformer (ViT) which extracts multi-scale feature maps.
+2. Bottleneck MLP: A simple feed-forward network that collects features from the encoder's middle layers
+   (e.g., 8 out of 12 layers for ViT-Base).
+3. Vision Transformer Decoder: Consisting of Transformer layers (typically 8), it learns to reconstruct the
+   compressed middle-level features by maximising cosine similarity with the encoder's features.
+
+Only the parameters of the bottleneck MLP and the decoder are trained.
+
+#### Key Components
+
+1. Foundation Transformer Models: Dinomaly leverages pre-trained ViTs (like DinoV2) which provide universal and
+   discriminative features. This use of foundation models enables strong performance across various image patterns.
+2. Noisy Bottleneck: This component activates built-in Dropout within the MLP bottleneck.
+   By randomly discarding neural activations, Dropout acts as a "pseudo feature anomaly," which forces the decoder
+   to restore only normal features. This helps prevent the decoder from becoming too adept at reconstructing
+   anomalous patterns it has not been specifically trained on.
+3. Linear Attention: Instead of traditional Softmax Attention, Linear Attention is used in the decoder.
+   Linear Attention's inherent inability to heavily focus on local regions, a characteristic sometimes seen as a
+   "side effect" in supervised tasks, is exploited here. This property encourages attention to spread across
+   the entire image, reducing the likelihood of the decoder simply forwarding identical information
+   from unexpected or anomalous patterns. This also contributes to computational efficiency.
+4. Loose Reconstruction:
+   1. Loose Constraint: Rather than enforcing rigid layer-to-layer reconstruction, Dinomaly groups multiple
+      encoder layers as a whole for reconstruction (e.g., into low-semantic and high-semantic groups).
+      This provides the decoder with more degrees of freedom, allowing it to behave more distinctly from the
+      encoder when encountering unseen patterns.
+   2. Loose Loss: The point-by-point reconstruction loss function is loosened by employing a hard-mining
+      global cosine loss. This approach detaches the gradients of feature points that are already well-reconstructed
+      during training, preventing the model from becoming overly proficient at reconstructing all features,
+      including those that might correspond to anomalies.
+
+### Anomaly Detection
+
+Anomaly detection is performed by computing cosine similarity between encoder and decoder features at multiple scales.
+The model generates anomaly maps by analyzing the reconstruction quality of features, where poor reconstruction
+indicates anomalous regions. Both anomaly detection (image-level) and localization (pixel-level) are supported.
+
+## Usage
+
+`anomalib train --model Dinomaly --data MVTecAD --data.category <category>`
+
+## Benchmark
+
+All results gathered with seed `42`. The `max_steps` parameter is set to `5000` for training.
+
+## [MVTec AD Dataset](https://www.mvtec.com/company/research/datasets/mvtec-ad)
+
+### Image-Level AUC
+
+|          |  Avg  | Carpet | Grid  | Leather | Tile  | Wood  | Bottle | Cable | Capsule | Hazelnut | Metal Nut | Pill  | Screw | Toothbrush | Transistor |
+| -------- | :---: | :----: | :---: | :-----: | :---: | :---: | :----: | :---: | :-----: | :------: | :-------: | :---: | :---: | :--------: | :--------: |
+| Dinomaly | 0.995 | 0.998  | 0.999 |  1.000  | 1.000 | 0.993 | 1.000  | 1.000 |  0.988  |  1.000   |   1.000   | 0.993 | 0.985 |   1.000    |   0.997    |
+
+### Pixel-Level AUC
+
+|          |  Avg  | Carpet | Grid  | Leather | Tile  | Wood  | Bottle | Cable | Capsule | Hazelnut | Metal Nut | Pill  | Screw | Toothbrush | Transistor |
+| -------- | :---: | :----: | :---: | :-----: | :---: | :---: | :----: | :---: | :-----: | :------: | :-------: | :---: | :---: | :--------: | :--------: |
+| Dinomaly | 0.981 | 0.993  | 0.993 |  0.993  | 0.975 | 0.975 | 0.990  | 0.981 |  0.986  |  0.994   |   0.969   | 0.977 | 0.997 |   0.988    |   0.950    |
+
+### Image F1 Score
+
+|          |  Avg  | Carpet | Grid  | Leather | Tile  | Wood  | Bottle | Cable | Capsule | Hazelnut | Metal Nut | Pill  | Screw | Toothbrush | Transistor |
+| -------- | :---: | :----: | :---: | :-----: | :---: | :---: | :----: | :---: | :-----: | :------: | :-------: | :---: | :---: | :--------: | :--------: |
+| Dinomaly | 0.985 | 0.983  | 0.991 |  0.995  | 0.994 | 0.975 | 1.000  | 0.995 |  0.982  |  1.000   |   1.000   | 0.986 | 0.957 |   0.983    |   0.976    |
@@ -0,0 +1,38 @@
+# Copyright (C) 2025 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+"""Dinomaly: Vision Transformer-based Anomaly Detection with Feature Reconstruction.
+
+The Dinomaly model implements a Vision Transformer encoder-decoder architecture for
+anomaly detection using pre-trained DINOv2 features. The model extracts features from
+multiple intermediate layers of a DINOv2 encoder, compresses them through a bottleneck
+MLP, and reconstructs them using a Vision Transformer decoder.
+
+Anomaly detection is performed by computing cosine similarity between encoder and decoder
+features at multiple scales. The model is particularly effective for visual anomaly
+detection tasks where the goal is to identify regions or images that deviate from
+normal patterns learned during training.
+
+Example:
+    >>> from anomalib.models.image import Dinomaly
+    >>> model = Dinomaly()
+
+The model can be used with any of the supported datasets and task modes in
+anomalib. It leverages the powerful feature representations from DINOv2 Vision
+Transformers combined with a reconstruction-based approach for robust anomaly detection.
+
+Notes:
+    - Uses DINOv2 Vision Transformer as the backbone encoder
+    - Features are extracted from intermediate layers for multi-scale analysis
+    - Employs feature reconstruction loss for unsupervised learning
+    - Supports both anomaly detection and localization tasks
+    - Requires significant GPU memory due to Vision Transformer architecture
+
+See Also:
+    :class:`anomalib.models.image.dinomaly.lightning_model.Dinomaly`:
+        Lightning implementation of the Dinomaly model.
+"""
+
+from anomalib.models.image.dinomaly.lightning_model import Dinomaly
+
+__all__ = ["Dinomaly"]
@@ -0,0 +1,50 @@
+# Copyright (C) 2025 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+"""Components module for Dinomaly model.
+
+This module provides all the necessary components for the Dinomaly Vision Transformer
+architecture including layers, model loader, utilities, and vision transformer implementations.
+"""
+
+# Layer components
+from .layers import (
+    Attention,
+    Block,
+    DinomalyMLP,
+    LinearAttention,
+    MemEffAttention,
+)
+
+# Model loader
+from .model_loader import DinoV2Loader, load
+
+# Utility functions and classes
+from .training_utils import (
+    CosineHardMiningLoss,
+    StableAdamW,
+    WarmCosineScheduler,
+)
+
+# Vision transformer components
+from .vision_transformer import (
+    DinoVisionTransformer,
+)
+
+__all__ = [
+    # Layers
+    "Attention",
+    "Block",
+    "DinomalyMLP",
+    "LinearAttention",
+    "MemEffAttention",
+    # Model loader
+    "DinoV2Loader",
+    "load",
+    # Utils
+    "StableAdamW",
+    "WarmCosineScheduler",
+    "CosineHardMiningLoss",
+    # Vision transformer
+    "DinoVisionTransformer",
+]