RynnEC_demo.mp4
- [2025.08.08] π₯π₯ Release our RynnEC-2B model, RynnEC-Bench and training code.
RynnEC is a video multi-modal large language model (MLLM) specifically designed for embodied cognition tasks.
Basic Dependencies:
- Python >= 3.10
- Pytorch >= 2.4.0
- CUDA Version >= 11.8
- transformers >= 4.46.3
Install required packages:
git clone https://github.com/alibaba-damo-academy/RynnEC
cd RynnEC
pip install -e .
pip install flash-attn --no-build-isolation
Model | Base Model | HF Link |
---|---|---|
RynnEC-2B | Qwen2.5-1.5B | Alibaba-DAMO-Academy/RynnEC-2B |
Checkout inference notebooks that demonstrate how to use RynnEC on various applications such as basic object understanding, spatial understanding and video object segmentation in egocentric world.
Notebooks | Description |
---|---|
Object Understanding | Demonstrates how to use RynnEC for general object recognition and understanding |
Spatial Understanding | Demonstrations of using RynnEC for spatial understanding with 3D awareness |
Video Object Segmentation | Demonstrations of using RynnEC for video object segmentation with text-based instructions |
It is highly recommended to try our online demo first.
Otherwise, you can launch a gradio app locally:
python inference/gradio_demo.py --model-path Alibaba-DAMO-Academy/RynnEC-2B
options:
--model-path MODEL_PATH, --model_path MODEL_PATH
--port SERVER_PORT, --server_port SERVER_PORT
Optional. Port of the model server.
RynnEC-Bench evaluates the models in two key areas: object cognition
and spatial cognition
, evaluating a total of 22
embodied cognitive abilities.
For more details, please refer to RynnEC-Bench.
To use our training code, please organize the annotation files in the following format:
[
// image QA
{
"image": ["images/xxx.jpg"],
"conversations": [
{
"from": "human",
"value": "<image>\nWhat are the colors of the bus in the image?"
},
{
"from": "gpt",
"value": "The bus in the image is white and red."
},
...
]
},
// Video QA
{
"video": ["videos/xxx.mp4"],
"conversations": [
{
"from": "human",
"value": "<video>\nWhat are the main activities that take place in the video?"
},
{
"from": "gpt",
"value": "The main activities that take place in the video are the preparation of camera equipment by a man, a group of men riding a helicopter, and a man sailing a boat through the water."
},
...
]
},
// Video-object QA (mp4 file)
{
"video": ["videos/xxx.mp4"],
"conversations": [
{
"from": "human",
"value": "<video>\nWhat is the color of <region>?"
},
{
"from": "gpt",
"value": "The color is red."
}
],
"masks": [
{
"frame id": {"size": [1080, 1920], "counts": "mask rle"},
"frame id": {"size": [1080, 1920], "counts": "mask rle"}
}
],
},
// Video-object QA (image files)
{
"video": ["videos/xxx/0.png", "videos/xxx/1.png", "videos/xxx/2.png", ...],
"conversations": [
{
"from": "human",
"value": "<video>\nWhat is the color of <region>?"
},
{
"from": "gpt",
"value": "The color is red."
}
],
"masks": [
{
"frame id": {"size": [1080, 1920], "counts": "mask rle"},
"frame id": {"size": [1080, 1920], "counts": "mask rle"}
}
],
"mask_ids": ["the frame index of each mask in the video list"],
"timestamps": ["timestamp of video frames"],
},
// Image-object QA
{
"video": ["images/xxx.jpg"],
"conversations": [
{
"from": "human",
"value": "<video>\nWhat is the relationshipw between object1<region> and object2<region>?"
},
{
"from": "gpt",
"value": "They are side by side."
}
],
"masks": [
{"size": [1080, 1920], "counts": "mask rle"},
{"size": [1080, 1920], "counts": "mask rle"}
],
},
]
We provide some templates in scripts/train
for all stages. You can modify the variables to fit your settings of data and models based on them. For example:
--data_folder ./datasets \
--data_path stage4.json \
--model_path Alibaba-DAMO-Academy/RynnEC-2B \
--vision_encoder DAMO-NLP-SG/SigLIP-NaViT \
Now you can start training with your training scripts:
# stage1
bash scripts/train/stage1.sh
# stage2
bash scripts/train/stage2.sh
...
If you use lora
in the training stage, use the following command to merge the LORA weights after training:
python tools/merge_lora_weights.py --model_path checkpoints/stage4/checkpoint-xxx --save_path checkpoints/stage4_merge
Please prepare the datasets and question files used for evaluation here.
# for object property cognition
bash scripts/eval/eval_object_property.sh
# for object segmentation
bash scripts/eval/eval_seg.sh
# for spatial cognition
bash scripts/eval/eval_spatial.sh
Note:
Fill in the API_KEY, URL in the metrics.py first.
If you find RynnEC useful for your research and applications, please cite using this BibTeX:
Our RynnEC is built on top of VideoLLaMA3. We also learned a lot from the implementation of VideoRefer, Sa2VA, and Qwen2VL. If your work is used in RynnEC but not mentioned in either this repo or the technical report, feel free to let us know β€οΈ.
This project is released under the Apache 2.0 license as found in the LICENSE file. The service is a research preview intended for non-commercial use ONLY, subject to the model Licenses of Qwen, Terms of Use of the data generated by OpenAI and Gemini, and Privacy Practices of ShareGPT. Please get in touch with us if you find any potential violations.