Skip to content

Automated Workflow Agent factory for constructing Affect Computing (e.g., Multimodal Emotion Recognition and Reasoning, Sentiment Analysis) datasets.

License

Notifications You must be signed in to change notification settings

Lum1104/MER-Factory

Repository files navigation

👉🏻 MER-Factory 👈🏻

中文  |   English  


Your automated factory for constructing Multimodal Emotion Recognition and Reasoning (MERR) datasets.

📖 Documentation

DOI

🚀 Project Roadmap

MER-Factory is under active development with new features being added regularly - check our roadmap and welcome contributions!

Table of Contents

Pipeline Structure

Click here to expand/collapse
graph TD;
        __start__([<p>__start__</p>]):::first
        setup_paths(setup_paths)
        handle_error(handle_error)
        run_au_extraction(run_au_extraction)
        save_au_results(save_au_results)
        generate_audio_description(generate_audio_description)
        save_audio_results(save_audio_results)
        generate_video_description(generate_video_description)
        save_video_results(save_video_results)
        extract_full_features(extract_full_features)
        filter_by_emotion(filter_by_emotion)
        find_peak_frame(find_peak_frame)
        generate_peak_frame_visual_description(generate_peak_frame_visual_description)
        generate_peak_frame_au_description(generate_peak_frame_au_description)
        synthesize_summary(synthesize_summary)
        save_mer_results(save_mer_results)
        run_image_analysis(run_image_analysis)
        synthesize_image_summary(synthesize_image_summary)
        save_image_results(save_image_results)
        __end__([<p>__end__</p>]):::last
        __start__ --> setup_paths;
        extract_full_features --> filter_by_emotion;
        filter_by_emotion -.-> find_peak_frame;
        filter_by_emotion -.-> handle_error;
        filter_by_emotion -.-> save_au_results;
        find_peak_frame --> generate_audio_description;
        generate_audio_description -.-> generate_video_description;
        generate_audio_description -.-> handle_error;
        generate_audio_description -.-> save_audio_results;
        generate_peak_frame_au_description --> synthesize_summary;
        generate_peak_frame_visual_description --> generate_peak_frame_au_description;
        generate_video_description -.-> generate_peak_frame_visual_description;
        generate_video_description -.-> handle_error;
        generate_video_description -.-> save_video_results;
        run_au_extraction --> filter_by_emotion;
        run_image_analysis --> synthesize_image_summary;
        setup_paths -. &nbsp;full_pipeline&nbsp; .-> extract_full_features;
        setup_paths -. &nbsp;audio_pipeline&nbsp; .-> generate_audio_description;
        setup_paths -. &nbsp;video_pipeline&nbsp; .-> generate_video_description;
        setup_paths -.-> handle_error;
        setup_paths -. &nbsp;au_pipeline&nbsp; .-> run_au_extraction;
        setup_paths -. &nbsp;image_pipeline&nbsp; .-> run_image_analysis;
        synthesize_image_summary --> save_image_results;
        synthesize_summary --> save_mer_results;
        handle_error --> __end__;
        save_au_results --> __end__;
        save_audio_results --> __end__;
        save_image_results --> __end__;
        save_mer_results --> __end__;
        save_video_results --> __end__;
        classDef default fill:#f2f0ff,line-height:1.2
        classDef first fill-opacity:0
        classDef last fill:#bfb6fc
Loading

Features

  • Action Unit (AU) Pipeline: Extracts facial Action Units (AUs) and translates them into descriptive natural language.
  • Audio Analysis Pipeline: Extracts audio, transcribes speech, and performs detailed tonal analysis.
  • Video Analysis Pipeline: Generates comprehensive descriptions of video content and context.
  • Image Analysis Pipeline: Provides end-to-end emotion recognition for static images, complete with visual descriptions and emotional synthesis.
  • Full MER Pipeline: An end-to-end multimodal pipeline that identifies peak emotional moments, analyzes all modalities (visual, audio, facial), and synthesizes a holistic emotional reasoning summary.

Check out example outputs here:

Installation

📚 Please visit project documentation for detailed installation and usage instructions.

Usage

Basic Command Structure

python main.py [INPUT_PATH] [OUTPUT_DIR] [OPTIONS]

Examples

# Show all supported args.
python main.py --help

# Full MER pipeline with Gemini (default)
python main.py path_to_video/ output/ --type MER --silent --threshold 0.8

# Using Sentiment Analysis task instead of MERR
python main.py path_to_video/ output/ --type MER --task "Sentiment Analysis" --silent

# Using ChatGPT models
python main.py path_to_video/ output/ --type MER --chatgpt-model gpt-4o --silent

# Using local Ollama models
python main.py path_to_video/ output/ --type MER --ollama-vision-model llava-llama3:latest --ollama-text-model llama3.2 --silent

# Using Hugging Face model
python main.py path_to_video/ output/ --type MER --huggingface-model google/gemma-3n-E4B-it --silent

# Process images instead of videos
python main.py ./images ./output --type MER

Note: Run ollama pull llama3.2 etc, if Ollama model is needed. Ollama does not support video analysis for now.

Dashboard for Data Curation and Hyperparameter Tuning

We provide an interactive dashboard webpage to facilitate data curation and hyperparameter tuning. The dashboard allows you to test different prompts, save and run configurations, and rate the generated data.

To launch the dashboard, use the following command:

python dashboard.py

Command Line Options

Option Short Description Default
--type -t Processing type (AU, audio, video, image, MER) MER
--task -tk Analysis task type (MERR, Sentiment Analysis) MERR
--label-file -l Path to a CSV file with 'name' and 'label' columns. Optional, for ground truth labels. None
--threshold -th Emotion detection threshold (0.0-5.0) 0.8
--peak_dis -pd Steps between peak frame detection (min 8) 15
--silent -s Run with minimal output False
--cache -ca Reuse existing audio/video/AU results from previous pipeline runs False
--concurrency -c Concurrent files for async processing (min 1) 4
--ollama-vision-model -ovm Ollama vision model name None
--ollama-text-model -otm Ollama text model name None
--chatgpt-model -cgm ChatGPT model name (e.g., gpt-4o) None
--huggingface-model -hfm Hugging Face model ID None

Processing Types

1. Action Unit (AU) Extraction

Extracts facial Action Units and generates natural language descriptions:

python main.py video.mp4 output/ --type AU

2. Audio Analysis

Extracts audio, transcribes speech, and analyzes tone:

python main.py video.mp4 output/ --type audio

3. Video Analysis

Generates comprehensive video content descriptions:

python main.py video.mp4 output/ --type video

4. Image Analysis

Runs the pipeline with image input:

python main.py ./images ./output --type image
# Note: Image files will automatically use image pipeline regardless of --type setting

5. Full MER Pipeline (Default)

Runs the complete multimodal emotion recognition pipeline:

python main.py video.mp4 output/ --type MER
# or simply:
python main.py video.mp4 output/

Task Types

The --task option allows you to choose between different analysis tasks:

1. Emotion Recognition (Default)

Performs detailed emotion analysis with granular emotion categories:

python main.py video.mp4 output/ --task "Emotion Recognition"
# or simply omit the --task option since it's the default
python main.py video.mp4 output/

2. Sentiment Analysis

Performs sentiment-focused analysis (positive, negative, neutral):

python main.py video.mp4 output/ --task "Sentiment Analysis"

Export the Dataset

To export datasets for curation or training, use the following commands:

For Dataset Curation

python export.py --output_folder "{output_folder}" --file_type {file_type.lower()} --export_path "{export_path}" --export_csv

For Training

python export.py --input_csv path/to/csv_file.csv --export_format sharegpt

Model Support

The tool supports four types of models:

  1. Google Gemini (default): Requires GOOGLE_API_KEY in .env
  2. OpenAI ChatGPT: Requires OPENAI_API_KEY in .env, specify with --chatgpt-model
  3. Ollama: Local models, specify with --ollama-vision-model and --ollama-text-model
  4. Hugging Face: Currently supports multimodal models like google/gemma-3n-E4B-it

Note: If using Hugging Face models, concurrency is automatically set to 1 for synchronous processing.

Model Recommendations

When to Use Ollama

Recommended for: Image analysis, Action Unit analysis, text processing, and simple audio transcription tasks.

Benefits:

  • Async support: Ollama supports asynchronous calling, making it ideal for processing large datasets efficiently
  • Local processing: No API costs or rate limits
  • Wide model selection: Visit ollama.com to explore available models
  • Privacy: All processing happens locally

Example usage:

# Process images with Ollama
python main.py ./images ./output --type image --ollama-vision-model llava-llama3:latest --ollama-text-model llama3.2 --silent

# AU extraction with Ollama
python main.py video.mp4 output/ --type AU --ollama-text-model llama3.2 --silent

When to Use ChatGPT/Gemini

Recommended for: Advanced video analysis, complex multimodal reasoning, and high-quality content generation.

Benefits:

  • State-of-the-art performance: Latest GPT-4o and Gemini models offer superior reasoning capabilities
  • Advanced video understanding: Better support for complex video analysis and temporal reasoning
  • High-quality outputs: More nuanced and detailed emotion recognition and reasoning
  • Robust multimodal integration: Excellent performance across text, image, and video modalities

Example usage:

python main.py video.mp4 output/ --type MER --chatgpt-model gpt-4o --silent

python main.py video.mp4 output/ --type MER --silent

Trade-offs: API costs and rate limits, but typically provides the highest quality results for complex emotion reasoning tasks.

When to Use Hugging Face Models

Recommended for: When you need the latest state-of-the-art models or specific features not available in Ollama.

Custom Model Integration: If you want to use the latest HF models or features that Ollama doesn't support:

  1. Option 1 - Implement yourself: Navigate to mer_factory/models/hf_models/__init__.py to register your own model and implement the needed functions following our existing patterns.

  2. Option 2 - Request support: Open an issue on our repository to let us know which model you'd like us to support, and we'll consider adding it.

Current supported models: google/gemma-3n-E4B-it and others listed in the HF models directory.

Citation

If you find MER-Factory useful in your research or project, please consider giving us a ⭐! Your support helps us grow and continue improving.

Additionally, if you use MER-Factory in your work, please consider cite us using the following BibTeX entries:

@software{Lin_MER-Factory_2025,
  author = {Lin, Yuxiang and Zheng, Shunchao},
  doi = {10.5281/zenodo.15847351},
  license = {MIT},
  month = {7},
  title = {{MER-Factory}},
  url = {https://github.com/Lum1104/MER-Factory},
  version = {0.1.0},
  year = {2025}
}

@inproceedings{NEURIPS2024_c7f43ada,
  author = {Cheng, Zebang and Cheng, Zhi-Qi and He, Jun-Yan and Wang, Kai and Lin, Yuxiang and Lian, Zheng and Peng, Xiaojiang and Hauptmann, Alexander},
  booktitle = {Advances in Neural Information Processing Systems},
  editor = {A. Globerson and L. Mackey and D. Belgrave and A. Fan and U. Paquet and J. Tomczak and C. Zhang},
  pages = {110805--110853},
  publisher = {Curran Associates, Inc.},
  title = {Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning},
  url = {https://proceedings.neurips.cc/paper_files/paper/2024/file/c7f43ada17acc234f568dc66da527418-Paper-Conference.pdf},
  volume = {37},
  year = {2024}
}

About

Automated Workflow Agent factory for constructing Affect Computing (e.g., Multimodal Emotion Recognition and Reasoning, Sentiment Analysis) datasets.

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

  •  
  •