State-of-the-art implementation of the CVPR 2016 paper:
A Hierarchical Deep Temporal Model for Group Activity Recognition
Improvement | Impact |
---|---|
Modern Backbone | Replaced AlexNet with ResNet50 for superior feature extraction |
Framework Upgrade | Full PyTorch implementation (original was Caffe) |
Performance Boost | Achieved 92.3% accuracy |
The dataset consists of 4,830 annotated frames extracted from 55 YouTube volleyball videos. It includes:
- 8 team activity labels (e.g., "Left Spike", "Right Winpoint")
- 9 player action labels (e.g., "Blocking", "Setting")
- Player bounding boxes with action annotations
Sample frame showing team activity and player bounding boxes
- Total Videos: 55 (IDs 0-54)
- Splits:
- Train: Videos 1, 3, 6, 7, 10, 13, 15, 16, 18, 22, 23, 31, 32, 36, 38-42, 48, 50, 52-54
- Validation: Videos 0, 2, 8, 12, 17, 19, 24, 26-28, 30, 33, 46, 49, 51
- Test: Videos 4, 5, 9, 11, 14, 20, 21, 25, 29, 34, 35, 37, 43-45, 47
volleyball/
└── video_{ID}/ # Each of the 55 videos (0-54)
├── frame_{timestamp_A}/ # First key moment (e.g. 29885)
│ ├── 00001.jpg # -20 frames
│ ├── ... # ...
│ ├── 00021.jpg # Target frame (timestamp_A)
│ ├── ... # ...
│ └── 00041.jpg # +20 frames
├── frame_{timestamp_B}/ # Second key moment (e.g. 29886)
│ ├── 00001.jpg # -20 frames
│ └── ... # Same structure
├── ... # More frame directories
└── annotations.txt # Lists ALL key moments
Original Dataset Repository, For further information.
This section outlines the baselines , based on the CVPR 2016 paper: A Hierarchical Deep Temporal Model for Group Activity Recognition by Ibrahim et al.
- Architecture: Single-frame ResNet-50
- Description: A basic image-level classifier that processes the entire scene using a CNN (ResNet-50).
- Architecture: ResNet-50 per player → feature pooling → FC
- Description: CNN applied to each detected person individually. The extracted features are pooled across people and passed to a softmax classifier.
- Architecture: ResNet-50 (fine-tuned for person action) per player → feature pooling → FC
- Description: Similar to B2, but the CNN is fine-tuned for person-level action classification.
- Architecture: ResNet-50 on full image → LSTM → FC
- Description: Temporal extension of B1. Whole image features are extracted and passed through an LSTM for sequence modeling.
- Architecture: ResNet-50 per player → feature pooling per frame → LSTM → FC
- Description: Temporal extension of B2. Pooled person features over time are input to an LSTM to model group activity sequences.
- Architecture: ResNet-50 per player (fine-tuned) → pooled → Group-LSTM → FC
- Description: Similar to the full model but removes the first LSTM responsible for modeling individual person dynamics. Only a group-level LSTM is used.
- Architecture: ResNet-50 per player (fine-tuned) → Player-LSTM → pooled → FC
- Description: Omits the group-level LSTM. Only temporal modeling is applied at the player level, followed by feature pooling and final classification.
- Architecture: ResNet-50 (fine-tuned per player) → Player-LSTM → pooling → Group-LSTM → FC
- Description: The complete two-stage model proposed in the paper. Captures both individual temporal actions and group-level temporal dynamics.
1. Temporal Modeling is Essential Comparing B3 vs B6 shows:
- Adding LSTM improves accuracy
- Temporal dynamics critical for activity understanding
B3: Without Temporal Modeling | B6: With Player-LSTM |
---|---|
![]() |
![]() |
2. Team-Aware Pooling
- Independent team feature processing (Hierarchical Two-stage Temporal Model) imporoves accuracy and reduces confusion between Left/right winpoints
B7: Without Team-Aware Pooling | B8: With Team-Aware Pooling |
---|---|
![]() |
![]() |