DINO (Detection Transformer) represents a significant advancement in object detection, particularly for complex scenarios like pedestrian detection. It builds upon the DETR architecture while incorporating deformable attention mechanisms and multi-scale feature processing.
Key features:
- Transformer-based architecture with deformable attention
- Multi-scale feature processing for better detection across object sizes
- Enhanced query selection and bipartite matching
- Improved convergence through iterative refinement
- Efficient handling of both sparse and dense object scenarios
The model utilizes ResNet-50 as its backbone and employs a 4-scale feature pyramid, making it particularly effective for detecting pedestrians at various scales and under different conditions.
root_directory/
│
├── DINO-main/ # Cloned DINO repository
├── training/ # Training dataset (160 images)
├── validation/ # Validation dataset (40 images)
├── DINO_output/ # Original DINO evaluation results
├── DINO_finetune/ # Fine-tuning outputs
├── DINO_finetune_output/ # Fine-tuned model evaluation results
└── DINO_assignment.ipynb # Main implementation notebook
-
Clone the DINO repository:
-
Place the Jupiter notebook in the root directory and execute cells sequentially.
The pre-trained DINO-4scale model with ResNet-50 backbone demonstrated robust performance on our pedestrian detection task. Here's a detailed analysis of its performance:
Metric | Value |
---|---|
AP@0.5 | 0.766 |
AP@0.75 | 0.443 |
AP (mean) | 0.430 |
AR@100 | 0.492 |
Total Objects Detected: 278 out of 310 ground truth objects
The pie chart shows the distribution of detection quality:
- Perfect Detections (±5%): 37.5%
- Over-detections: 17.5%
- Under-detections: 45.0%
Performance across complexity levels:
- Simple scenes: 0.985 detection ratio
- Medium scenes: 0.936 detection ratio
- Complex scenes: 0.916 detection ratio
The box plot reveals:
- Higher consistency in simple scenes
- Increased variance in complex scenarios
- Notable performance degradation with scene complexity
The scatter plot demonstrates:
- Strong correlation between ground truth and predictions
- Slight tendency toward under-detection in crowded scenes
- Consistent performance in sparse scenarios
The distribution shows:
- Peak around 1.0 (perfect detection)
- Slight negative skew indicating under-detection bias
- Reasonable spread suggesting consistent performance
The image demonstrates the model's capability in:
- Accurate pedestrian localization
- Handling multiple pedestrians
- Dealing with partial occlusions
- Maintaining performance under varying lighting conditions
The DINO-4scale model shows strong baseline performance, particularly in:
- Accurate detection in standard scenarios
- Handling of multiple scales
- Robust performance in varying lighting conditions
- Reasonable handling of occlusions
- Data Augmentation Pipeline
transforms = A.Compose([
A.Resize(height=480, width=640),
A.OneOf([
A.RandomBrightnessContrast(p=0.5),
A.RandomGamma(p=0.5),
], p=0.3),
A.OneOf([
A.GaussNoise(p=0.5),
A.ISONoise(p=0.5),
], p=0.2),
A.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]),
ToTensorV2()
])
Benefits:
- Enhanced model robustness through diverse training samples
- Better handling of lighting variations
- Improved noise resistance
- Consistent input scaling
- Optimization Strategy
- Gradient Accumulation (steps=4)
- Enables larger effective batch size
- Stabilizes training
- Better gradient estimates
- AdamW Optimizer
- Adaptive learning rates
- Improved weight decay handling
- Learning Rate Scheduling
- Warm-up period for stability
- Cyclic adjustments for better convergence
- Loss Function Components
- Classification Loss (weight=2.0)
- Focused on accurate pedestrian identification
- Bounding Box Loss (weight=5.0)
- Enhanced localization accuracy
- GIoU Loss (weight=2.0)
- Better box regression and overlap handling
The learning rate plot shows:
- Initial warm-up phase (epochs 1-3)
- Peak learning rate at epoch 5
- Gradual decay for fine-grained optimization
- Final cool-down phase for convergence
The loss curves demonstrate:
- Rapid initial convergence
- Consistent reduction in both training and validation loss
- Minimal gap between training and validation indicates good generalization
- Final convergence by epoch 15
The attention maps reveal:
-
Low-level Feature Attention
- Focus on edge and contour information
- Sensitivity to human silhouettes
- Strong boundary detection
-
Mid-level Feature Attention
- Enhanced pedestrian shape recognition
- Contextual area consideration
- Improved occlusion handling
-
High-level Feature Attention
- Semantic understanding of pedestrian features
- Global context integration
- Robust scale adaptation
The fine-tuned model achieved significant improvements in detection capability:
- Total Objects Detected: 489 (compared to 278 in base model)
- True Positive Rate: 81.3%
- False Positive Rate: 18.7%
Key improvements observed:
-
Enhanced Detection Accuracy
- Better handling of partially visible pedestrians
- Improved confidence scores
- More precise bounding boxes
-
Crowd Handling
- Better separation of overlapping pedestrians
- Improved detection in dense groups
- Reduced missed detections in complex scenes
-
Scale Robustness
- Consistent detection across different pedestrian sizes
- Better handling of distant pedestrians
- Improved detection of close-up subjects
-
Environmental Adaptation
- Better performance in varying lighting conditions
- Improved handling of shadows
- Enhanced detection in complex backgrounds
-
Confidence Scoring
- More calibrated confidence scores
- Reduced false positives
- Better threshold behavior
Quantitative Improvements:
- Average Precision increased by 15.6%
- Recall improved by 15.2%
- Detection ratio variance reduced by 22.3%
- Perfect detection rate increased to 52.5%
The fine-tuned model demonstrates significant improvements in:
- Overall detection accuracy
- Handling of complex scenes
- Confidence calibration
- Robustness to environmental variations
Metric | Base DINO | Fine-tuned DINO | Improvement |
---|---|---|---|
AP@0.5 | 0.766 | 0.823 | +7.4% |
AP@0.75 | 0.443 | 0.512 | +15.6% |
AP (mean) | 0.430 | 0.486 | +13.0% |
AR@100 | 0.492 | 0.567 | +15.2% |
Total Detections | 278 | 489 | +75.9% |
Perfect Detections | 37.5% | 52.5% | +15.0% |
Over-detections | 17.5% | 12.5% | -5.0% |
Under-detections | 45.0% | 35.0% | -10.0% |
Base DINO | Fine-tuned DINO |
---|---|
![]() |
![]() |
- Improved confidence scores
- Better handling of distant pedestrians
- More precise bounding boxes
Base DINO | Fine-tuned DINO |
---|---|
![]() |
![]() |
- Enhanced crowd separation
- Reduced missed detections
- Better occlusion handling
Base DINO | Fine-tuned DINO |
---|---|
![]() |
![]() |
- Improved background discrimination
- Better handling of partial visibility
- More robust to environmental variations
Base DINO | Fine-tuned DINO |
---|---|
![]() |
![]() |
- Consistent detection across scales
- Better confidence calibration
- Improved boundary precision
-
Detection Quality
- 75.9% increase in total detections
- 15% increase in perfect detections
- 10% reduction in under-detections
-
Precision Improvements
- Better bounding box localization
- More accurate confidence scoring
- Reduced false positives
-
Robustness Enhancements
- Improved handling of occlusions
- Better performance in crowded scenes
- Enhanced scale invariance
The fine-tuning experiments on the DINO model for pedestrian detection demonstrate significant improvements across multiple metrics. Key achievements include:
-
Quantitative Improvements
- Substantial increase in detection capability (75.9% more detections)
- Enhanced precision across all AP metrics
- Improved recall and reduced false positives
-
Qualitative Enhancements
- Better handling of complex scenes
- More robust to environmental variations
- Improved confidence calibration
-
Technical Advancements
- Successful adaptation to specific domain
- Effective utilization of multi-scale features
- Improved attention mechanisms
-
Practical Implications
- More reliable pedestrian detection
- Better performance in real-world scenarios
- Enhanced applicability for surveillance systems
- Further optimization for extremely crowded scenes
- Integration of temporal information
- Enhancement of real-time performance
- Extension to multiple camera views
All model outputs, trained weights, and visualization results are available at: Google Drive Link
The drive contains:
- Pre-trained and fine-tuned model weights
- Training logs and metrics
- Visualization results
- Attention maps
- Inference results for both models
- Complete evaluation metrics