Skip to content

Steel bar alignment detection system using Vision Transformers and KNN-based pseudo-labeling for real-time image classification.

Notifications You must be signed in to change notification settings

dansimaa/ViT-KNN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ViT-KNN: Semi-Supervised Pseudo-Labeling with Vision Transformers and KNN

This repository contains the codebase developed by the CUDA_Libre team for the Neural Wave Hackathon 2024, where our solution earned 1st place. The project automates the verification of steel bar alignment in a rolling mill using state-of-the-art Computer Vision models, combining semi-supervised Vision Transformers (ViT) and KNN-based pseudo-labeling. By enhancing operational efficiency and reducing human error, this system offers a scalable solution to modernize steel bar manufacturing processes.

Problem Context

Fig. 1 depicts a sequence of steel bars moving towards a stopper on a rolling table. The goal is to assess whether the bars are properly aligned. Currently, this alignment check is performed manually by human operators who rely solely on visual inspection of real-time images. Determining alignment can be challenging due to uncertainties caused by various factors, including perspective distortions, vibrations, shadows, and inconsistent lighting conditions.
Manual inspection of steel bar alignment is a labor intensive task that can lead to errors due to operator fatigue. Our solution automates this verification, allowing plant operators to focus on more critical aspects of the production process. The workflow of our approach can be divided into two key stages:

  1. Semi-Supervised Labeling Workflow
  2. Model Training and Inference

Steel bar alignment process

Fig. 1 Sample images showing a sequence of aligned and not aligned bars on a rolling table approaching the stopper.

Data Labeling Pipeline

Here is a diagram illustrating the data labeling workflow, which integrates human labeling and pseudo-labeling by leveraging DINOv2 model embeddings and KNN label assignment through similarity search.

Data Labeling Workflow

Methodology

1. DINOv2 KNN-based Pseudo-Labeling Workflow

Given the large, mostly unlabeled dataset of 15,630 images, we used an efficient approach to label the dataset, which combines human-labeling and pseudo-labeling.

  • Human Labeling: We labeled manually an initial subset of 5,000 images, creating a foundation for reliable training and test data.

  • DINOv2 for Embeddings: We used DINOv2, a self-supervised vision transformer model, to generate high-dimensional embeddings of the images. These embeddings capture complex semantic features, without requiring any fine-tuning, that make it possible to measure image similarity effectively.

  • K-Nearest Neighbors (KNN) with FAISS: We used FAISS for fast, scalable similarity searches within the embedding space. For each unlabeled image, we identified its K-nearest neighbors and assigned a label based on a majority vote of their known labels, taken from the manually labeled dataset.

  • Cosine Similarity: To ensure robust label assignment, we employ cosine similarity to compare image features and calculating "distances" in the KNN embedding space, with the following similarity function $m$:

$$m(s, r) = \text{cosine-similarity} (f(s), f(r)) = \frac{f(s) \cdot f(r)}{\|f(s)\|_2 \|f(r)\|_2}$$

where $s$ and $r$ are a pair of images to compare and $f$ is the model generating features. This method enabled us to expand the labeled dataset efficiently without manual effort for each image. To run the Pseudo-Labeling check out the documentation: DINOv2 KNN-based Pseudo-Labeling.

2. Model Training, Inference and Results

The expanded dataset was used to train an EfficientNet-B0 model, chosen for its balance of accuracy and computational efficiency. We trained EfficientNet-B0 starting from the original weigths, adapting the classification layer for binary classification measuring of the alignment status. EfficientNet-B0 was also compared against the MobileNetV2 model.

  • Training Details: The model was trained for 30 epochs, with the peak validation performance observed at epoch 10. Key performance metrics included:

    • Accuracy: 93.40%
    • Precision: 94.37%
    • Recall: 95.82%
    • F1 Score: 95.09%
  • The model demonstrated reliable classification capabilities with a mean inference time on the test set of 0.0298 seconds per image, meeting the real-time inference requirement of under 0.5 seconds per image.

Inference Time Statistic Time (seconds)
Mean Time 0.0298
25th Percentile 0.0111
Median (50th Percentile) 0.0117
75th Percentile 0.0128

Run the Code

Installation

Install the required packages with:

pip install -r requirements.txt

Training

To perform the Pseudo-Labeling check out the documentation: DINOv2 KNN-based Pseudo-Labeling. To train the EfficentNet-B0 model, run the training script:

python train.py \
    --data_config_path "dataset/augmented_split.json" \
    --batch_size 32 \
    --num_epochs 30 \
    --learning_rate 0.0001 \
    --checkpoint_path "checkpoints/efficient_net"

Testing

Evaluate the model performance on the test set using:

python test.py \
    --data_config_path "dataset/split.json" \
    --batch_size 16 \
    --model_path "checkpoints/efficient_net/20241027_083453/model_epoch_10.pt"

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Steel bar alignment detection system using Vision Transformers and KNN-based pseudo-labeling for real-time image classification.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages