中文 | English
Your automated factory for constructing Multimodal Emotion Recognition and Reasoning (MERR) datasets.
MER-Factory is under active development with new features being added regularly - check our roadmap and welcome contributions!
Click here to expand/collapse
graph TD;
__start__([<p>__start__</p>]):::first
setup_paths(setup_paths)
handle_error(handle_error)
run_au_extraction(run_au_extraction)
save_au_results(save_au_results)
generate_audio_description(generate_audio_description)
save_audio_results(save_audio_results)
generate_video_description(generate_video_description)
save_video_results(save_video_results)
extract_full_features(extract_full_features)
filter_by_emotion(filter_by_emotion)
find_peak_frame(find_peak_frame)
generate_peak_frame_visual_description(generate_peak_frame_visual_description)
generate_peak_frame_au_description(generate_peak_frame_au_description)
synthesize_summary(synthesize_summary)
save_mer_results(save_mer_results)
run_image_analysis(run_image_analysis)
synthesize_image_summary(synthesize_image_summary)
save_image_results(save_image_results)
__end__([<p>__end__</p>]):::last
__start__ --> setup_paths;
extract_full_features --> filter_by_emotion;
filter_by_emotion -.-> find_peak_frame;
filter_by_emotion -.-> handle_error;
filter_by_emotion -.-> save_au_results;
find_peak_frame --> generate_audio_description;
generate_audio_description -.-> generate_video_description;
generate_audio_description -.-> handle_error;
generate_audio_description -.-> save_audio_results;
generate_peak_frame_au_description --> synthesize_summary;
generate_peak_frame_visual_description --> generate_peak_frame_au_description;
generate_video_description -.-> generate_peak_frame_visual_description;
generate_video_description -.-> handle_error;
generate_video_description -.-> save_video_results;
run_au_extraction --> filter_by_emotion;
run_image_analysis --> synthesize_image_summary;
setup_paths -. full_pipeline .-> extract_full_features;
setup_paths -. audio_pipeline .-> generate_audio_description;
setup_paths -. video_pipeline .-> generate_video_description;
setup_paths -.-> handle_error;
setup_paths -. au_pipeline .-> run_au_extraction;
setup_paths -. image_pipeline .-> run_image_analysis;
synthesize_image_summary --> save_image_results;
synthesize_summary --> save_mer_results;
handle_error --> __end__;
save_au_results --> __end__;
save_audio_results --> __end__;
save_image_results --> __end__;
save_mer_results --> __end__;
save_video_results --> __end__;
classDef default fill:#f2f0ff,line-height:1.2
classDef first fill-opacity:0
classDef last fill:#bfb6fc
- Action Unit (AU) Pipeline: Extracts facial Action Units (AUs) and translates them into descriptive natural language.
- Audio Analysis Pipeline: Extracts audio, transcribes speech, and performs detailed tonal analysis.
- Video Analysis Pipeline: Generates comprehensive descriptions of video content and context.
- Image Analysis Pipeline: Provides end-to-end emotion recognition for static images, complete with visual descriptions and emotional synthesis.
- Full MER Pipeline: An end-to-end multimodal pipeline that identifies peak emotional moments, analyzes all modalities (visual, audio, facial), and synthesizes a holistic emotional reasoning summary.
Check out example outputs here:
📚 Please visit project documentation for detailed installation and usage instructions.
python main.py [INPUT_PATH] [OUTPUT_DIR] [OPTIONS]
# Show all supported args.
python main.py --help
# Full MER pipeline with Gemini (default)
python main.py path_to_video/ output/ --type MER --silent --threshold 0.8
# Using Sentiment Analysis task instead of MERR
python main.py path_to_video/ output/ --type MER --task "Sentiment Analysis" --silent
# Using ChatGPT models
python main.py path_to_video/ output/ --type MER --chatgpt-model gpt-4o --silent
# Using local Ollama models
python main.py path_to_video/ output/ --type MER --ollama-vision-model llava-llama3:latest --ollama-text-model llama3.2 --silent
# Using Hugging Face model
python main.py path_to_video/ output/ --type MER --huggingface-model google/gemma-3n-E4B-it --silent
# Process images instead of videos
python main.py ./images ./output --type MER
Note: Run ollama pull llama3.2
etc, if Ollama model is needed. Ollama does not support video analysis for now.
We provide an interactive dashboard webpage to facilitate data curation and hyperparameter tuning. The dashboard allows you to test different prompts, save and run configurations, and rate the generated data.
To launch the dashboard, use the following command:
python dashboard.py
Option | Short | Description | Default |
---|---|---|---|
--type |
-t |
Processing type (AU, audio, video, image, MER) | MER |
--task |
-tk |
Analysis task type (MERR, Sentiment Analysis) | MERR |
--label-file |
-l |
Path to a CSV file with 'name' and 'label' columns. Optional, for ground truth labels. | None |
--threshold |
-th |
Emotion detection threshold (0.0-5.0) | 0.8 |
--peak_dis |
-pd |
Steps between peak frame detection (min 8) | 15 |
--silent |
-s |
Run with minimal output | False |
--cache |
-ca |
Reuse existing audio/video/AU results from previous pipeline runs | False |
--concurrency |
-c |
Concurrent files for async processing (min 1) | 4 |
--ollama-vision-model |
-ovm |
Ollama vision model name | None |
--ollama-text-model |
-otm |
Ollama text model name | None |
--chatgpt-model |
-cgm |
ChatGPT model name (e.g., gpt-4o) | None |
--huggingface-model |
-hfm |
Hugging Face model ID | None |
Extracts facial Action Units and generates natural language descriptions:
python main.py video.mp4 output/ --type AU
Extracts audio, transcribes speech, and analyzes tone:
python main.py video.mp4 output/ --type audio
Generates comprehensive video content descriptions:
python main.py video.mp4 output/ --type video
Runs the pipeline with image input:
python main.py ./images ./output --type image
# Note: Image files will automatically use image pipeline regardless of --type setting
Runs the complete multimodal emotion recognition pipeline:
python main.py video.mp4 output/ --type MER
# or simply:
python main.py video.mp4 output/
The --task
option allows you to choose between different analysis tasks:
Performs detailed emotion analysis with granular emotion categories:
python main.py video.mp4 output/ --task "Emotion Recognition"
# or simply omit the --task option since it's the default
python main.py video.mp4 output/
Performs sentiment-focused analysis (positive, negative, neutral):
python main.py video.mp4 output/ --task "Sentiment Analysis"
To export datasets for curation or training, use the following commands:
python export.py --output_folder "{output_folder}" --file_type {file_type.lower()} --export_path "{export_path}" --export_csv
python export.py --input_csv path/to/csv_file.csv --export_format sharegpt
The tool supports four types of models:
- Google Gemini (default): Requires
GOOGLE_API_KEY
in.env
- OpenAI ChatGPT: Requires
OPENAI_API_KEY
in.env
, specify with--chatgpt-model
- Ollama: Local models, specify with
--ollama-vision-model
and--ollama-text-model
- Hugging Face: Currently supports multimodal models like
google/gemma-3n-E4B-it
Note: If using Hugging Face models, concurrency is automatically set to 1 for synchronous processing.
Recommended for: Image analysis, Action Unit analysis, text processing, and simple audio transcription tasks.
Benefits:
- ✅ Async support: Ollama supports asynchronous calling, making it ideal for processing large datasets efficiently
- ✅ Local processing: No API costs or rate limits
- ✅ Wide model selection: Visit ollama.com to explore available models
- ✅ Privacy: All processing happens locally
Example usage:
# Process images with Ollama
python main.py ./images ./output --type image --ollama-vision-model llava-llama3:latest --ollama-text-model llama3.2 --silent
# AU extraction with Ollama
python main.py video.mp4 output/ --type AU --ollama-text-model llama3.2 --silent
Recommended for: Advanced video analysis, complex multimodal reasoning, and high-quality content generation.
Benefits:
- ✅ State-of-the-art performance: Latest GPT-4o and Gemini models offer superior reasoning capabilities
- ✅ Advanced video understanding: Better support for complex video analysis and temporal reasoning
- ✅ High-quality outputs: More nuanced and detailed emotion recognition and reasoning
- ✅ Robust multimodal integration: Excellent performance across text, image, and video modalities
Example usage:
python main.py video.mp4 output/ --type MER --chatgpt-model gpt-4o --silent
python main.py video.mp4 output/ --type MER --silent
Trade-offs: API costs and rate limits, but typically provides the highest quality results for complex emotion reasoning tasks.
Recommended for: When you need the latest state-of-the-art models or specific features not available in Ollama.
Custom Model Integration: If you want to use the latest HF models or features that Ollama doesn't support:
-
Option 1 - Implement yourself: Navigate to
mer_factory/models/hf_models/__init__.py
to register your own model and implement the needed functions following our existing patterns. -
Option 2 - Request support: Open an issue on our repository to let us know which model you'd like us to support, and we'll consider adding it.
Current supported models: google/gemma-3n-E4B-it
and others listed in the HF models directory.
If you find MER-Factory useful in your research or project, please consider giving us a ⭐! Your support helps us grow and continue improving.
Additionally, if you use MER-Factory in your work, please consider cite us using the following BibTeX entries:
@software{Lin_MER-Factory_2025,
author = {Lin, Yuxiang and Zheng, Shunchao},
doi = {10.5281/zenodo.15847351},
license = {MIT},
month = {7},
title = {{MER-Factory}},
url = {https://github.com/Lum1104/MER-Factory},
version = {0.1.0},
year = {2025}
}
@inproceedings{NEURIPS2024_c7f43ada,
author = {Cheng, Zebang and Cheng, Zhi-Qi and He, Jun-Yan and Wang, Kai and Lin, Yuxiang and Lian, Zheng and Peng, Xiaojiang and Hauptmann, Alexander},
booktitle = {Advances in Neural Information Processing Systems},
editor = {A. Globerson and L. Mackey and D. Belgrave and A. Fan and U. Paquet and J. Tomczak and C. Zhang},
pages = {110805--110853},
publisher = {Curran Associates, Inc.},
title = {Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning},
url = {https://proceedings.neurips.cc/paper_files/paper/2024/file/c7f43ada17acc234f568dc66da527418-Paper-Conference.pdf},
volume = {37},
year = {2024}
}