π§ ISAS 2025: Abnormal Behavior Recognition using Pose-Based Feature Engineering and Deep Ensemble Learning
We propose an abnormal activity recognition system for individuals with developmental disabilities using 2D pose keypoint data and deep learning. The solution consists of:
- A feature engineering pipeline crafted from raw keypoints
- A dual deep learning ensemble combining Bi-LSTM and Transformer, optimized for detecting abrupt behaviors like "Attacking" or "Throwing things"
The Hoang Nguyen, Gia Huy Ly, and Duy Khanh Dinh Hoang. All authors are currently studying at VNU-HCM, Ho Chi Minh City University of Technology (HCMUT)
The dataset is provided by the ISAS 2025 challenge:
- 4 subjects for training, 1 subject for testing using LOSO (Leave-One-Subject-Out)
- 8 labeled activities: 4 normal (e.g., Sitting, Walking) and 4 unusual (e.g., Biting nails, Attacking)
- Pose keypoints extracted via YOLOv7 at 30 FPS
Main challenges:
- Data imbalance: more normal than abnormal frames
- Temporal variability between activity types
- Subject-specific differences in motion styles
- Short and unpredictable unusual behaviors (e.g., Attacking)
We designed over 70 continuous features per frame from keypoints to capture motion, geometry, asymmetry, and temporal-frequency characteristics:
Feature Group | Description |
---|---|
Motion | Velocity, acceleration, jerk of hands and nose |
Geometric | Euclidean distances (e.g., handβnose), joint angles (elbow, knee), torso angle, hand-above-shoulder flag |
Asymmetry | Speed and position differences between left/right hands |
Temporal statistics | Rolling mean, std, and max (1.5s window β 45 frames) |
Frequency & regularity | Dominant frequency (FFT), zero-crossing rate (ZCR), movement regularity |
All features are computed on interpolated and smoothed keypoints to reduce noise.
- Two Bi-LSTM layers (128, 64 units)
- BatchNorm and Dropout for generalization
- Effective for repetitive behaviors (Walking, Sitting)
- Bi-LSTM for short-term motion encoding
- Transformer for long-range, non-linear dependencies
- Effective for bursty behaviors (Attacking, Throwing)
Softmax probability weighted average:
- Bi-LSTM: 52%
- Hybrid: 48%
Weights tuned via LOSO cross-validation.
Component | Value |
---|---|
Frame rate | 30 FPS |
Input sequence length | 60 frames (β 2 seconds) |
Feature rolling window | 45 frames (β 1.5 seconds) |
Overlap rate | ~90% |
Subjects used for training | 1, 2, 3, 4, 5 |
Sliding window segmentation ensures dense sampling for short-duration activities.
- Input: unlabeled pose sequences
- Output:
participant_id, timestamp, predicted_label
- Metrics: Accuracy, Abnormal F1-Score, Precision, Recall
- Evaluate model generalization on unseen subject
- Submit LOSO-specific evaluation report
Below is the average performance across all LOSO folds for abnormal behaviors using the Ensemble model (Bi-LSTM + Hybrid):
Abnormal Behavior | Average F1-score |
---|---|
Attacking | 76.58% |
Biting nails | 71.86% |
Head banging | 78.42% |
Throwing things | 77.20% |
Normal Behavior | Average F1-score |
---|---|
Eating snacks | 61.32% |
Sitting quietly | 45.40% |
Using phone | 37.40% |
Walking | 94.54% |
[team_name]_test.csv
: format[participant_id, timestamp, predicted_label]
- Engineered 70+ temporal and geometric features from 2D keypoints
- Designed a hybrid deep model suitable for both smooth and irregular behaviors
- Applied ensemble fusion for improved recognition accuracy
- Tuned temporal parameters using rolling window and LLM-guided prompting
- Achieved high performance on short, bursty and challenging behaviors
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
- Script:
process_data.py
- Input:
./data/keypointlabel/keypoints_with_labels_<id>.csv
for IDs 1, 2, 3, 5 - Output:
features_continuous_unfiltered.csv
- Extracts 70+ handcrafted features (motion, geometry, asymmetry, temporal-frequency)
- Script:
train_hybrid_tuned.py
- Output:
saved_models/hybrid_tuned/
best_model_fold_<id>.keras
,scaler_fold_<id>.joblib
,label_encoder.joblib
, andfeature_cols.json
- Script:
train_lstm_tuned.py
- Output:
saved_models/lstm_tuned/
best_model_fold_<id>.keras
,scaler_fold_<id>.joblib
,label_encoder.joblib
, andfeature_cols.json
- Script:
ensemble_loso.py
- Input: trained models from step 2 and 3
- Output: prints accuracy, F1-score, and per-fold classification reports
- Script:
train_final_lstm_tuned_model.py
- Output:
final_lstm_tuned_model_artifacts/
- Includes full model, scaler, label encoder, and selected features
- Script:
train_final_hybrid_model.py
- Output:
final_hybrid_model_artifacts/
- Similar structure, includes Transformer block for long-range dependencies
- Script:
process_data_test.py
- Input:
test data_keypoint.csv
- Output:
features_test.csv
- Same features as training, built from interpolated keypoints
- Script:
create_submission.py
- Combines:
final_lstm_tuned_model
+final_hybrid_model
(soft voting: 52/48) - Output:
Binary_Phoenix_test.csv
with:participant_id
,timestamp
,predicted_label
- Script:
task_2_processdata.py
- Input:
./data/keypointlabel/keypoints_with_labels_<id>.csv
for IDs 1, 2, 3, 4, 5 - Output:
final.csv
- Description: Extracts 70+ handcrafted features (motion, geometry, asymmetry, temporal-frequency) for all 5 participants. Data is interpolated and cleaned to ensure consistent feature space.
- Script:
task_2_hybrid_tuned.py
- Input:
final.csv
from Step 1 - Output:
final_report_saved_models/hybrid_tuned/
best_model_fold_<id>.keras
for each foldscaler_fold_<id>.joblib
for each foldencoder.joblib
(label encoder used for all folds)feature_columns.joblib
(features used by the model)
- Description: Trains Hybrid model with LOSO across 5 participants. Each fold trains on 4 participants and tests on the remaining participant.
- Script:
task_2_lstm_tuned.py
- Input:
final.csv
from Step 1 - Output:
final_report_saved_models/lstm_tuned/
best_model_fold_<id>.keras
for each foldscaler_fold_<id>.joblib
for each foldencoder.joblib
feature_columns.joblib
- Description: Trains Deep Bi-LSTM model with LOSO across 5 participants. Results saved per fold for later ensemble.
- Script:
task_2_ensembleloso.py
- Input: Models from Step 2 & Step 3
- Output:
- LOSO per-fold Accuracy and Macro F1-score
- Mean Accuracy and Macro F1-score across 5 folds
ensemble_confusion_matrix.png
(aggregated confusion matrix for 5 folds)
- Description: Combines predictions from Hybrid and LSTM models using soft voting (Hybrid 48%, LSTM 52%). Evaluates model generalization to unseen participants.
-
Script:
weighted.py
-
Purpose:
Finds the optimal weighting between Hybrid and LSTM models using soft voting, maximizing the weighted F1-score (abnormal classes are prioritized using class weights). -
How it works:
- Performs grid search (e.g., Hybrid weights from 0.0 to 1.0)
- Uses preloaded fold predictions from both models
- Applies weight Γ3 for abnormal activity classes during
f1_score
computation - Prints out scores and the best weight combination
-
Note:
Run this script beforeensemble_loso.py
to determine the best weight ratio.