Skip to content

A tool for automatic annotation of human interaction videos with pose, face, emotion, speech & more.

License

Notifications You must be signed in to change notification settings

InfantLab/VideoAnnotator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

VideoAnnotator

FastAPI License: MIT Python 3.12+ uv Docker Tests Ask DeepWiki

Automated video analysis toolkit for human interaction research - Extract comprehensive behavioral annotations from videos using AI pipelines, with an intuitive web interface for visualization and analysis.

๐ŸŽฏ What is VideoAnnotator?

VideoAnnotator automatically analyzes videos of human interactions and extracts rich behavioral data including:

  • ๐Ÿ‘ฅ Person tracking - Multi-person detection and pose estimation with persistent IDs
  • ๐Ÿ˜Š Facial analysis - Emotions, expressions, gaze direction, and action units
  • ๐ŸŽฌ Scene detection - Environment classification and temporal segmentation
  • ๐ŸŽค Audio analysis - Speech recognition, speaker identification, and emotion detection

Perfect for researchers studying parent-child interactions, social behavior, developmental psychology, and human-computer interaction.

๐Ÿ–ฅ๏ธ Complete Solution: Processing + Visualization

VideoAnnotator provides both automated processing and interactive visualization:

๐Ÿ“น VideoAnnotator (this repository)

AI-powered video processing pipeline

  • Processes videos to extract behavioral annotations
  • REST API for integration with research workflows
  • Supports batch processing and custom configurations
  • Outputs standardized JSON data

Interactive web-based visualization tool

  • Load and visualize VideoAnnotator results
  • Synchronized video playback with annotation overlays
  • Timeline scrubbing with pose, face, and audio data
  • Export tools for further analysis

Complete workflow: Your Videos โ†’ [VideoAnnotator Processing] โ†’ Annotation Data โ†’ [Video Annotation Viewer] โ†’ Interactive Analysis

๐Ÿš€ Get Started in 60 Seconds

1. Quick Setup

# Install modern Python package manager
curl -LsSf https://astral.sh/uv/install.sh | sh  # Linux/Mac
# powershell -c "irm https://astral.sh/uv/install.ps1 | iex"  # Windows

# Clone and install
git clone https://github.com/InfantLab/VideoAnnotator.git
cd VideoAnnotator
uv sync  # Fast dependency installation (30 seconds)

2. Start Processing Videos

# Start the API server
uv run python api_server.py
# Note the API key printed on first startup - you'll need it below

# Process your first video (in another terminal)
curl -X POST "http://localhost:18011/api/v1/jobs/" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "video=@your_video.mp4" \
  -F "selected_pipelines=person,face,scene,audio"

# Check results at http://localhost:18011/docs

3. Visualize Results

# Install the companion web viewer
git clone https://github.com/InfantLab/video-annotation-viewer.git
cd video-annotation-viewer
npm install
npm run dev

Note: Ensure Node and NPM are installed. On macOS with Homebrew:
brew install node

# Open http://localhost:3000 and load your VideoAnnotator results

๐ŸŽ‰ That's it! You now have both automated video processing and interactive visualization.

๐Ÿง  AI Pipelines & Capabilities

Authoritative pipeline metadata (names, tasks, modalities, capabilities) is generated from the registry:

  • Pipeline specification table: docs/pipelines_spec.md (auto-generated; do not edit by hand)
  • Emotion output format spec: docs/specs/emotion_output_format.md

Additional Specs:

  • Output Naming Conventions: docs/specs/output_naming_conventions.md (stable patterns for downstream tooling)
  • Emotion Validator Utility: src/validation/emotion_validator.py (programmatic validation of .emotion.json files)
  • CLI Validation: videoannotator validate-emotion path/to/file.emotion.json returns non-zero exit on failure Client tools (e.g. the Video Annotation Viewer) should rely on those sources or the /api/v1/pipelines endpoint rather than hard-coding pipeline assumptions.

Person Tracking Pipeline

  • Technology: YOLO11 + ByteTrack multi-object tracking
  • Outputs: Bounding boxes, pose keypoints, persistent person IDs
  • Use cases: Movement analysis, social interaction tracking, activity recognition

Face Analysis Pipeline

  • Technology: OpenFace 3.0, LAION Face, OpenCV backends
  • Outputs: 68-point landmarks, emotions, action units, gaze direction, head pose
  • Use cases: Emotional analysis, attention tracking, facial expression studies

Scene Detection Pipeline

  • Technology: PySceneDetect + CLIP environment classification
  • Outputs: Scene boundaries, environment labels, temporal segmentation
  • Use cases: Context analysis, setting classification, behavioral context

Audio Processing Pipeline

  • Technology: OpenAI Whisper + pyannote speaker diarization
  • Outputs: Speech transcripts, speaker identification, voice emotions
  • Use cases: Conversation analysis, language development, vocal behavior

๐Ÿ’ก Why VideoAnnotator?

๐ŸŽฏ Built for Researchers

  • No coding required - Web interface and REST API
  • Standardized outputs - JSON formats compatible with analysis tools
  • Reproducible results - Version-controlled processing pipelines
  • Batch processing - Handle multiple videos efficiently

๐Ÿ”ฌ Research-Grade Accuracy

  • State-of-the-art models - YOLO11, OpenFace 3.0, Whisper
  • Validated pipelines - Tested on developmental psychology datasets
  • Comprehensive metrics - Confidence scores, validation tools
  • Flexible configuration - Adjust parameters for your research needs

โšก Production Ready

  • Fast processing - GPU acceleration, optimized pipelines
  • Scalable architecture - Docker containers, API-first design
  • Cross-platform - Windows, macOS, Linux support
  • Enterprise features - Authentication, logging, monitoring

๐Ÿ”’ Privacy & Data Protection

  • 100% Local Processing - All analysis runs on your hardware, no cloud dependencies
  • No Data Transmission - Videos and results never leave your infrastructure
  • GDPR Compliant - Full control over sensitive research data
  • Foundation Model Free - No external API calls to commercial AI services
  • Research Ethics Ready - Designed for studies requiring strict data confidentiality

๐Ÿ“Š Example Output

VideoAnnotator generates rich, structured data like this:

{
  "person_tracking": [
    {
      "timestamp": 12.34,
      "person_id": 1,
      "bbox": [0.2, 0.3, 0.4, 0.5],
      "pose_keypoints": [...],
      "confidence": 0.87
    }
  ],
  "face_analysis": [
    {
      "timestamp": 12.34,
      "person_id": 1,
      "emotion": "happy",
      "confidence": 0.91,
      "facial_landmarks": [...],
      "gaze_direction": [0.1, -0.2]
    }
  ],
  "scene_detection": [
    {
      "start_time": 0.0,
      "end_time": 45.6,
      "scene_type": "living_room",
      "confidence": 0.95
    }
  ],
  "audio_analysis": [
    {
      "start_time": 1.2,
      "end_time": 3.8,
      "speaker": "adult",
      "transcript": "Look at this toy!",
      "emotion": "excited"
    }
  ]
}

๐Ÿ”— Integration & Export

Direct Integration

  • Python: Import JSON data into pandas, matplotlib, seaborn
  • R: Load data with jsonlite, analyze with tidyverse
  • MATLAB: Process JSON with built-in functions

Annotation Tools

  • CVAT: Computer Vision Annotation Tool integration
  • LabelStudio: Machine learning annotation platform
  • ELAN: Linguistic annotation software compatibility

Analysis Platforms

  • Video Annotation Viewer: Interactive web-based analysis (recommended)
  • Custom dashboards: Build with our REST API
  • Jupyter notebooks: Examples included in repository

๐Ÿ› ๏ธ Installation & Usage

Method 1: Direct Installation (Recommended)

# Modern Python environment
curl -LsSf https://astral.sh/uv/install.sh | sh
git clone https://github.com/InfantLab/VideoAnnotator.git
cd VideoAnnotator
uv sync

# Start processing
uv run python api_server.py

Method 2: Docker (Production)

# CPU version (lightweight)
docker build -f Dockerfile.cpu -t videoannotator:cpu .
docker run -p 18011:8000 videoannotator:cpu

# GPU version (faster processing)
docker build -f Dockerfile.gpu -t videoannotator:gpu .
docker run -p 18011:8000 --gpus all videoannotator:gpu

# Development version (pre-cached models)
docker build -f Dockerfile.dev -t videoannotator:dev .
docker run -p 18011:8000 --gpus all videoannotator:dev

Method 3: Research Platform Integration

# Python API for custom workflows
from videoannotator import VideoAnnotator

annotator = VideoAnnotator()
results = annotator.process("video.mp4", pipelines=["person", "face"])

# Analyze results
import pandas as pd
df = pd.DataFrame(results['person_tracking'])
print(f"Detected {df['person_id'].nunique()} unique people")

๐Ÿ“š Documentation & Resources

Resource Description
๐Ÿ“– Interactive Docs Complete documentation with examples
๐ŸŽฎ Live API Testing Interactive API when server is running
๐Ÿš€ Getting Started Guide Step-by-step setup and first video
๐Ÿ”ง Installation Guide Detailed installation instructions
โš™๏ธ Pipeline Specifications Technical pipeline documentation
๐ŸŽฏ Demo Commands Example commands and workflows

๐Ÿ‘ฅ Research Applications

Developmental Psychology

  • Parent-child interaction studies with synchronized behavioral coding
  • Social development research with multi-person tracking
  • Language acquisition studies with audio-visual alignment

Clinical Research

  • Autism spectrum behavioral analysis with facial expression tracking
  • Therapy session analysis with emotion and engagement metrics
  • Developmental assessment with standardized behavioral measures

Human-Computer Interaction

  • User experience research with attention and emotion tracking
  • Interface evaluation with gaze direction and facial feedback
  • Accessibility studies with comprehensive behavioral data

๐Ÿ—๏ธ Architecture & Performance

Modern Technology Stack

  • FastAPI - High-performance REST API with automatic documentation
  • YOLO11 - State-of-the-art object detection and pose estimation
  • OpenFace 3.0 - Comprehensive facial behavior analysis
  • Whisper - Robust speech recognition and transcription
  • PyTorch - GPU-accelerated machine learning inference

Performance Characteristics

  • Processing speed: ~2-4x real-time with GPU acceleration
  • Memory usage: 4-8GB RAM for typical videos
  • Storage: ~100MB output per hour of video
  • Accuracy: 90%+ for person detection, 85%+ for emotion recognition

Scalability

  • Batch processing: Handle multiple videos simultaneously
  • Container deployment: Docker support for cloud platforms
  • Distributed processing: API-first design for microservices
  • Resource optimization: CPU and GPU variants available

๐Ÿค Contributing & Community

Getting Involved

Development

  • Code quality: 83% test coverage, modern Python practices
  • Documentation: Comprehensive guides and API documentation
  • CI/CD: Automated testing and deployment pipelines
  • Standards: Following research software engineering best practices

๐Ÿ“„ Citation & License

Citation

If you use VideoAnnotator in your research, please cite:

Addyman, C. (2025). VideoAnnotator: Automated video analysis toolkit for human interaction research.
Zenodo. https://Zenodo. doi.org/10.5281/zenodo.16961751

DOI

License

MIT License - Full terms in LICENSE

Funding & Support

  • The Global Parenting Initiative (Funded by The LEGO Foundation)

๐Ÿ™ Acknowledgments

Research Team

Open Source Dependencies

Built with and grateful to:

Development Tools & AI Assistance

Development was greatly helped by:

This project demonstrates how AI-assisted development can accelerate research software creation while maintaining code quality and comprehensive testing.


๐ŸŽฅ Ready to start analyzing videos? Follow the 60-second setup above!

About

A tool for automatic annotation of human interaction videos with pose, face, emotion, speech & more.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages