Group Activity Recognition

State-of-the-art implementation of the CVPR 2016 paper: A Hierarchical Deep Temporal Model for Group Activity Recognition

Key Improvements

Improvement	Impact
Modern Backbone	Replaced AlexNet with ResNet50 for superior feature extraction
Framework Upgrade	Full PyTorch implementation (original was Caffe)
Performance Boost	Achieved 92.3% accuracy

Volleyball Dataset

Overview

The dataset consists of 4,830 annotated frames extracted from 55 YouTube volleyball videos. It includes:

8 team activity labels (e.g., "Left Spike", "Right Winpoint")
9 player action labels (e.g., "Blocking", "Setting")
Player bounding boxes with action annotations

Sample frame showing team activity and player bounding boxes

Dataset Structure Summary

Video Organization

Total Videos: 55 (IDs 0-54)
Splits:
- Train: Videos 1, 3, 6, 7, 10, 13, 15, 16, 18, 22, 23, 31, 32, 36, 38-42, 48, 50, 52-54
- Validation: Videos 0, 2, 8, 12, 17, 19, 24, 26-28, 30, 33, 46, 49, 51
- Test: Videos 4, 5, 9, 11, 14, 20, 21, 25, 29, 34, 35, 37, 43-45, 47

Directory Organization

volleyball/
└── video_{ID}/                  # Each of the 55 videos (0-54)
    ├── frame_{timestamp_A}/      # First key moment (e.g. 29885)
    │   ├── 00001.jpg            # -20 frames
    │   ├── ...                  # ...
    │   ├── 00021.jpg            # Target frame (timestamp_A)
    │   ├── ...                  # ...
    │   └── 00041.jpg            # +20 frames
    ├── frame_{timestamp_B}/      # Second key moment (e.g. 29886)
    │   ├── 00001.jpg            # -20 frames 
    │   └── ...                  # Same structure
    ├── ...                      # More frame directories
    └── annotations.txt          # Lists ALL key moments

Original Dataset Repository, For further information.

Ablation Study

Baseline Models

This section outlines the baselines , based on the CVPR 2016 paper: A Hierarchical Deep Temporal Model for Group Activity Recognition by Ibrahim et al.

B1: Image Classification

Architecture: Single-frame ResNet-50
Description: A basic image-level classifier that processes the entire scene using a CNN (ResNet-50).

B2: Person Classification

Architecture: ResNet-50 per player → feature pooling → FC
Description: CNN applied to each detected person individually. The extracted features are pooled across people and passed to a softmax classifier.

B3: Fine-tuned Person Classification

Architecture: ResNet-50 (fine-tuned for person action) per player → feature pooling → FC
Description: Similar to B2, but the CNN is fine-tuned for person-level action classification.

B4: Temporal Model with Image Features

Architecture: ResNet-50 on full image → LSTM → FC
Description: Temporal extension of B1. Whole image features are extracted and passed through an LSTM for sequence modeling.

B5: Temporal Model with Person Features

Architecture: ResNet-50 per player → feature pooling per frame → LSTM → FC
Description: Temporal extension of B2. Pooled person features over time are input to an LSTM to model group activity sequences.

B6: No Player-LSTM (Only Group-LSTM)

Architecture: ResNet-50 per player (fine-tuned) → pooled → Group-LSTM → FC
Description: Similar to the full model but removes the first LSTM responsible for modeling individual person dynamics. Only a group-level LSTM is used.

B7: No Group-LSTM (Only Player-LSTM)

Architecture: ResNet-50 per player (fine-tuned) → Player-LSTM → pooled → FC
Description: Omits the group-level LSTM. Only temporal modeling is applied at the player level, followed by feature pooling and final classification.

B8: Full Hierarchical Model

Architecture: ResNet-50 (fine-tuned per player) → Player-LSTM → pooling → Group-LSTM → FC
Description: The complete two-stage model proposed in the paper. Captures both individual temporal actions and group-level temporal dynamics.

Key Findings

1. Temporal Modeling is Essential Comparing B3 vs B6 shows:

Adding LSTM improves accuracy
Temporal dynamics critical for activity understanding

B3: Without Temporal Modeling	B6: With Player-LSTM

2. Team-Aware Pooling

Independent team feature processing (Hierarchical Two-stage Temporal Model) imporoves accuracy and reduces confusion between Left/right winpoints

B7: Without Team-Aware Pooling	B8: With Team-Aware Pooling

Performance Comparison

Model	Accuracy	notebook
Baseline_1: Image Classification	74.8%
Baseline_3: Fine-Tuned Person Classification	79.1%
Baseline_4: Temporal Model with Image Features	77.7%
Baseline_5: Temporal Model with Person Features	49.9
Baseline_6: Two-stage Model without LSTM 1	82.0%
Baseline_7: Two-stage Model without LSTM 2	84.2%
Hierarchical Model: Two-stage Temporal Model	92.3%

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
Learner		Learner
data_utils		data_utils
helper_utils		helper_utils
models		models
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Group Activity Recognition

Table of Contents

Key Improvements

Volleyball Dataset

Overview

Dataset Structure Summary

Video Organization

Directory Organization

Ablation Study

Baseline Models

B1: Image Classification

B2: Person Classification

B3: Fine-tuned Person Classification

B4: Temporal Model with Image Features

B5: Temporal Model with Person Features

B6: No Player-LSTM (Only Group-LSTM)

B7: No Group-LSTM (Only Player-LSTM)

B8: Full Hierarchical Model

Key Findings

Performance Comparison

About

Uh oh!

Releases

Packages

Languages

shehab-ashraf/Group-Activity-Recognition

Folders and files

Latest commit

History

Repository files navigation

Group Activity Recognition

Table of Contents

Key Improvements

Volleyball Dataset

Overview

Dataset Structure Summary

Video Organization

Directory Organization

Ablation Study

Baseline Models

B1: Image Classification

B2: Person Classification

B3: Fine-tuned Person Classification

B4: Temporal Model with Image Features

B5: Temporal Model with Person Features

B6: No Player-LSTM (Only Group-LSTM)

B7: No Group-LSTM (Only Player-LSTM)

B8: Full Hierarchical Model

Key Findings

Performance Comparison

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages