A comprehensive Gradio application that combines Vision-Language Models (VLLMs) with computer vision techniques to analyze traffic scenes and detect license plates with high accuracy.
This application integrates multiple state-of-the-art AI models to provide:
- Traffic Scene Description - Using LLaVA-NeXT for comprehensive scene understanding
- License Plate Detection - Using YOLOv11 for accurate plate localization
- Text Extraction - Using PaddleOCR with advanced preprocessing for text recognition
- Structured Output - JSON format combining all analysis results
- π Multi-modal Analysis: Combines vision and language understanding
- π― Accurate Detection: YOLOv11-based license plate detection
- π Robust OCR: PaddleOCR with preprocessing and confidence filtering
- βοΈ Parameter Control: Adjustable thresholds and generation parameters
- π Optimized Performance: Memory-efficient model loading with quantization
- π Structured Output: Task-compliant JSON format
- YOLO Confidence Threshold (0.1-1.0): Controls detection sensitivity
- OCR Confidence Threshold (0.0-1.0): Filters low-quality text recognition
- VLLM Temperature (0.1-1.0): Controls creativity/randomness in descriptions
- VLLM Top-p (0.1-1.0): Controls diversity vs focus in language generation
- Platform: Kaggle
- Python: 3.8+
- CUDA: Compatible GPU for optimal performance (Kaggle typically provides a T4 or P100 GPU)
# Install packages
!pip install gradio ultralytics paddlepaddle paddleocr transformers torch torchvision accelerate bitsandbytes opencv-python pillow numpy
Upload custom YOLOv11 model trained for license plate detection
/kaggle/input/license_plate_detect_yolo11/pytorch/default/1/best.pt
Kaggle model link: https://www.kaggle.com/models/suhailaaboubakr/license_plate_detect_yolo11/
- Model:
llava-hf/llava-v1.6-mistral-7b-hf
- Quantization: 4-bit with BitsAndBytesConfig
- Purpose: Generate comprehensive traffic scene descriptions
- Optimization: Memory-efficient loading with device mapping
- Model: Custom trained
best.pt
- Input Size: 640x640 optimized
- Purpose: Detect and localize license plates in images
- Output: Bounding boxes with confidence scores
- Language: English optimized
- Features: Angle classification, text detection & recognition
- Preprocessing: CLAHE enhancement, noise reduction, thresholding
- Purpose: Extract text from detected license plate regions
- Upload Image: Select a traffic scene image
- Adjust Parameters (optional):
- YOLO Confidence: 0.5 (default)
- OCR Confidence: 0.1 (default)
- Temperature: 0.7 (default)
- Top-p: 0.9 (default)
- Click Submit: Process the image
- Review Results: Scene description, plate details, and JSON output



- Low (0.1-0.3): Detects more plates, higher false positive rate
- Medium (0.4-0.6): Balanced detection, recommended for most cases
- High (0.7-1.0): Only high-confidence detections, may miss some plates
- Very Low (0.0-0.1): Accept all OCR results, may include noise
- Low (0.1-0.3): Accept most readable text, some false readings
- Medium (0.3-0.6): Good balance of accuracy and recall
- High (0.6-1.0): Only high-quality text, may miss valid plates
- Low (0.1-0.3): More focused, factual descriptions
- Medium (0.4-0.6): Balanced creativity and accuracy
- High (0.7-1.0): More creative, potentially less accurate
- Low (0.1-0.5): Conservative vocabulary, more predictable
- Medium (0.6-0.8): Balanced diversity
- High (0.9-1.0): Maximum vocabulary diversity
The application outputs a structured JSON following the task specifications:
{
"scene_description": "",
"total_plates_detected": 1,
"license_plates": [
{
"bbox": [98, 82, 249, 140],
"detection_confidence": 0.5,
"plate_text": "",
"ocr_confidence": 0.9
}
],
"parameters_used": {
"yolo_confidence_threshold": 0.5,
"ocr_confidence_threshold": 0.5,
"vllm_temperature": 0.7,
"vllm_top_p": 0.5
}
}
The VLLM prompt is carefully crafted to extract maximum traffic-relevant information:
Analyze this traffic scene in detail. Describe:
1. Types of vehicles present (cars, trucks, motorcycles, etc.)
2. Traffic signs, signals, and road markings visible
3. Road conditions and infrastructure
4. Weather and lighting conditions
5. Overall traffic flow and density
6. Any notable safety considerations or hazards
Rationale:
- Structured approach: Numbered points ensure comprehensive coverage
- Traffic-focused: Specifically targets transportation elements
- Safety-oriented: Includes hazard identification
- Detailed yet concise: Balances thoroughness with readability
Default Temperature (0.7):
- Balances factual accuracy with descriptive richness
- Avoids overly repetitive descriptions
- Maintains focus on observable elements
Default Top-p (0.9):
- Allows diverse vocabulary while maintaining coherence
- Prevents overly conservative language choices
- Enables detailed technical descriptions
-
Memory Management:
- Use model quantization (4-bit enabled by default)
- Clear GPU cache between runs if needed
- Monitor VRAM usage
-
Speed Optimization:
- Resize large images before processing
- Use appropriate batch sizes
- Enable half-precision when supported
- Load models before inference
-
Accuracy Improvement:
- Use high-quality input images
- Adjust confidence thresholds based on use case
- Consider image preprocessing for difficult lighting
- Performance depends on image quality and lighting
- OCR accuracy varies with plate condition and angle
- Complex scenes may require parameter adjustment
- GPU memory limits maximum image resolution