👉🏻 MER-Factory 👈🏻

中文｜ English

Your automated factory for constructing Multimodal Emotion Recognition and Reasoning (MERR) datasets.

🚀 Project Roadmap

MER-Factory is under active development with new features being added regularly - check our roadmap and welcome contributions!

Pipeline Structure

Click here to expand/collapse

graph TD;
        __start__([<p>__start__</p>]):::first
        setup_paths(setup_paths)
        handle_error(handle_error)
        run_au_extraction(run_au_extraction)
        save_au_results(save_au_results)
        generate_audio_description(generate_audio_description)
        save_audio_results(save_audio_results)
        generate_video_description(generate_video_description)
        save_video_results(save_video_results)
        extract_full_features(extract_full_features)
        filter_by_emotion(filter_by_emotion)
        find_peak_frame(find_peak_frame)
        generate_peak_frame_visual_description(generate_peak_frame_visual_description)
        generate_peak_frame_au_description(generate_peak_frame_au_description)
        synthesize_summary(synthesize_summary)
        save_mer_results(save_mer_results)
        run_image_analysis(run_image_analysis)
        synthesize_image_summary(synthesize_image_summary)
        save_image_results(save_image_results)
        __end__([<p>__end__</p>]):::last
        __start__ --> setup_paths;
        extract_full_features --> filter_by_emotion;
        filter_by_emotion -.-> find_peak_frame;
        filter_by_emotion -.-> handle_error;
        filter_by_emotion -.-> save_au_results;
        find_peak_frame --> generate_audio_description;
        generate_audio_description -.-> generate_video_description;
        generate_audio_description -.-> handle_error;
        generate_audio_description -.-> save_audio_results;
        generate_peak_frame_au_description --> synthesize_summary;
        generate_peak_frame_visual_description --> generate_peak_frame_au_description;
        generate_video_description -.-> generate_peak_frame_visual_description;
        generate_video_description -.-> handle_error;
        generate_video_description -.-> save_video_results;
        run_au_extraction --> filter_by_emotion;
        run_image_analysis --> synthesize_image_summary;
        setup_paths -. &nbsp;full_pipeline&nbsp; .-> extract_full_features;
        setup_paths -. &nbsp;audio_pipeline&nbsp; .-> generate_audio_description;
        setup_paths -. &nbsp;video_pipeline&nbsp; .-> generate_video_description;
        setup_paths -.-> handle_error;
        setup_paths -. &nbsp;au_pipeline&nbsp; .-> run_au_extraction;
        setup_paths -. &nbsp;image_pipeline&nbsp; .-> run_image_analysis;
        synthesize_image_summary --> save_image_results;
        synthesize_summary --> save_mer_results;
        handle_error --> __end__;
        save_au_results --> __end__;
        save_audio_results --> __end__;
        save_image_results --> __end__;
        save_mer_results --> __end__;
        save_video_results --> __end__;
        classDef default fill:#f2f0ff,line-height:1.2
        classDef first fill-opacity:0
        classDef last fill:#bfb6fc

Features

Action Unit (AU) Pipeline: Extracts facial Action Units (AUs) and translates them into descriptive natural language.
Audio Analysis Pipeline: Extracts audio, transcribes speech, and performs detailed tonal analysis.
Video Analysis Pipeline: Generates comprehensive descriptions of video content and context.
Image Analysis Pipeline: Provides end-to-end emotion recognition for static images, complete with visual descriptions and emotional synthesis.
Full MER Pipeline: An end-to-end multimodal pipeline that identifies peak emotional moments, analyzes all modalities (visual, audio, facial), and synthesizes a holistic emotional reasoning summary.

Check out example outputs here:

Installation

📚 Please visit project documentation for detailed installation and usage instructions.

Usage

Basic Command Structure

python main.py [INPUT_PATH] [OUTPUT_DIR] [OPTIONS]

Examples

# Show all supported args.
python main.py --help

# Full MER pipeline with Gemini (default)
python main.py path_to_video/ output/ --type MER --silent --threshold 0.8

# Using Sentiment Analysis task instead of MERR
python main.py path_to_video/ output/ --type MER --task "Sentiment Analysis" --silent

# Using ChatGPT models
python main.py path_to_video/ output/ --type MER --chatgpt-model gpt-4o --silent

# Using local Ollama models
python main.py path_to_video/ output/ --type MER --ollama-vision-model llava-llama3:latest --ollama-text-model llama3.2 --silent

# Using Hugging Face model
python main.py path_to_video/ output/ --type MER --huggingface-model google/gemma-3n-E4B-it --silent

# Process images instead of videos
python main.py ./images ./output --type MER

Note: Run ollama pull llama3.2 etc, if Ollama model is needed. Ollama does not support video analysis for now.

Dashboard for Data Curation and Hyperparameter Tuning

We provide an interactive dashboard webpage to facilitate data curation and hyperparameter tuning. The dashboard allows you to test different prompts, save and run configurations, and rate the generated data.

To launch the dashboard, use the following command:

python dashboard.py

Command Line Options

Option	Short	Description	Default
`--type`	`-t`	Processing type (AU, audio, video, image, MER)	MER
`--task`	`-tk`	Analysis task type (MERR, Sentiment Analysis)	MERR
`--label-file`	`-l`	Path to a CSV file with 'name' and 'label' columns. Optional, for ground truth labels.	None
`--threshold`	`-th`	Emotion detection threshold (0.0-5.0)	0.8
`--peak_dis`	`-pd`	Steps between peak frame detection (min 8)	15
`--silent`	`-s`	Run with minimal output	False
`--cache`	`-ca`	Reuse existing audio/video/AU results from previous pipeline runs	False
`--concurrency`	`-c`	Concurrent files for async processing (min 1)	4
`--ollama-vision-model`	`-ovm`	Ollama vision model name	None
`--ollama-text-model`	`-otm`	Ollama text model name	None
`--chatgpt-model`	`-cgm`	ChatGPT model name (e.g., gpt-4o)	None
`--huggingface-model`	`-hfm`	Hugging Face model ID	None

Processing Types

1. Action Unit (AU) Extraction

Extracts facial Action Units and generates natural language descriptions:

python main.py video.mp4 output/ --type AU

2. Audio Analysis

Extracts audio, transcribes speech, and analyzes tone:

python main.py video.mp4 output/ --type audio

3. Video Analysis

Generates comprehensive video content descriptions:

python main.py video.mp4 output/ --type video

4. Image Analysis

Runs the pipeline with image input:

python main.py ./images ./output --type image
# Note: Image files will automatically use image pipeline regardless of --type setting

5. Full MER Pipeline (Default)

Runs the complete multimodal emotion recognition pipeline:

python main.py video.mp4 output/ --type MER
# or simply:
python main.py video.mp4 output/

Task Types

The --task option allows you to choose between different analysis tasks:

1. Emotion Recognition (Default)

Performs detailed emotion analysis with granular emotion categories:

python main.py video.mp4 output/ --task "Emotion Recognition"
# or simply omit the --task option since it's the default
python main.py video.mp4 output/

2. Sentiment Analysis

Performs sentiment-focused analysis (positive, negative, neutral):

python main.py video.mp4 output/ --task "Sentiment Analysis"

Export the Dataset

To export datasets for curation or training, use the following commands:

For Dataset Curation

python export.py --output_folder "{output_folder}" --file_type {file_type.lower()} --export_path "{export_path}" --export_csv

For Training

python export.py --input_csv path/to/csv_file.csv --export_format sharegpt

Model Support

The tool supports four types of models:

Google Gemini (default): Requires GOOGLE_API_KEY in .env
OpenAI ChatGPT: Requires OPENAI_API_KEY in .env, specify with --chatgpt-model
Ollama: Local models, specify with --ollama-vision-model and --ollama-text-model
Hugging Face: Currently supports multimodal models like google/gemma-3n-E4B-it

Note: If using Hugging Face models, concurrency is automatically set to 1 for synchronous processing.

Model Recommendations

When to Use Ollama

Recommended for: Image analysis, Action Unit analysis, text processing, and simple audio transcription tasks.

Benefits:

✅ Async support: Ollama supports asynchronous calling, making it ideal for processing large datasets efficiently
✅ Local processing: No API costs or rate limits
✅ Wide model selection: Visit ollama.com to explore available models
✅ Privacy: All processing happens locally

Example usage:

# Process images with Ollama
python main.py ./images ./output --type image --ollama-vision-model llava-llama3:latest --ollama-text-model llama3.2 --silent

# AU extraction with Ollama
python main.py video.mp4 output/ --type AU --ollama-text-model llama3.2 --silent

When to Use ChatGPT/Gemini

Recommended for: Advanced video analysis, complex multimodal reasoning, and high-quality content generation.

Benefits:

✅ State-of-the-art performance: Latest GPT-4o and Gemini models offer superior reasoning capabilities
✅ Advanced video understanding: Better support for complex video analysis and temporal reasoning
✅ High-quality outputs: More nuanced and detailed emotion recognition and reasoning
✅ Robust multimodal integration: Excellent performance across text, image, and video modalities

Example usage:

python main.py video.mp4 output/ --type MER --chatgpt-model gpt-4o --silent

python main.py video.mp4 output/ --type MER --silent

Trade-offs: API costs and rate limits, but typically provides the highest quality results for complex emotion reasoning tasks.

When to Use Hugging Face Models

Recommended for: When you need the latest state-of-the-art models or specific features not available in Ollama.

Custom Model Integration: If you want to use the latest HF models or features that Ollama doesn't support:

Option 1 - Implement yourself: Navigate to mer_factory/models/hf_models/__init__.py to register your own model and implement the needed functions following our existing patterns.
Option 2 - Request support: Open an issue on our repository to let us know which model you'd like us to support, and we'll consider adding it.

Current supported models: google/gemma-3n-E4B-it and others listed in the HF models directory.

Citation

If you find MER-Factory useful in your research or project, please consider giving us a ⭐! Your support helps us grow and continue improving.

Additionally, if you use MER-Factory in your work, please consider cite us using the following BibTeX entries:

@software{Lin_MER-Factory_2025,
  author = {Lin, Yuxiang and Zheng, Shunchao},
  doi = {10.5281/zenodo.15847351},
  license = {MIT},
  month = {7},
  title = {{MER-Factory}},
  url = {https://github.com/Lum1104/MER-Factory},
  version = {0.1.0},
  year = {2025}
}

@inproceedings{NEURIPS2024_c7f43ada,
  author = {Cheng, Zebang and Cheng, Zhi-Qi and He, Jun-Yan and Wang, Kai and Lin, Yuxiang and Lian, Zheng and Peng, Xiaojiang and Hauptmann, Alexander},
  booktitle = {Advances in Neural Information Processing Systems},
  editor = {A. Globerson and L. Mackey and D. Belgrave and A. Fan and U. Paquet and J. Tomczak and C. Zhang},
  pages = {110805--110853},
  publisher = {Curran Associates, Inc.},
  title = {Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning},
  url = {https://proceedings.neurips.cc/paper_files/paper/2024/file/c7f43ada17acc234f568dc66da527418-Paper-Conference.pdf},
  volume = {37},
  year = {2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
.github		.github
docs		docs
examples		examples
mer_factory		mer_factory
test		test
tools		tools
utils		utils
.env.example		.env.example
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
dashboard.py		dashboard.py
export.py		export.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

👉🏻 MER-Factory 👈🏻

🚀 Project Roadmap

Table of Contents

Pipeline Structure

Features

Installation

Usage

Basic Command Structure

Examples

Dashboard for Data Curation and Hyperparameter Tuning

Command Line Options

Processing Types

1. Action Unit (AU) Extraction

2. Audio Analysis

3. Video Analysis

4. Image Analysis

5. Full MER Pipeline (Default)

Task Types

1. Emotion Recognition (Default)

2. Sentiment Analysis

Export the Dataset

For Dataset Curation

For Training

Model Support

Model Recommendations

When to Use Ollama

When to Use ChatGPT/Gemini

When to Use Hugging Face Models

Citation

About

Uh oh!

Releases 1

Contributors 2

Uh oh!

Languages

License

Lum1104/MER-Factory

Folders and files

Latest commit

History

Repository files navigation

👉🏻 MER-Factory 👈🏻

🚀 Project Roadmap

Table of Contents

Pipeline Structure

Features

Installation

Usage

Basic Command Structure

Examples

Dashboard for Data Curation and Hyperparameter Tuning

Command Line Options

Processing Types

1. Action Unit (AU) Extraction

2. Audio Analysis

3. Video Analysis

4. Image Analysis

5. Full MER Pipeline (Default)

Task Types

1. Emotion Recognition (Default)

2. Sentiment Analysis

Export the Dataset

For Dataset Curation

For Training

Model Support

Model Recommendations

When to Use Ollama

When to Use ChatGPT/Gemini

When to Use Hugging Face Models

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Contributors 2

Uh oh!

Languages