Skip to content

alibaba-damo-academy/RynnEC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

45 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

If our project helps you, please give us a star ⭐ on GitHub to support us. πŸ™πŸ™

License hf_checkpoint hf_checkpoint hf_demo yt_video

ms_checkpoint ms_checkpoint arXiv

RynnEC_demo.mp4

πŸ“° News

  • [2025.08.08] πŸ”₯πŸ”₯ Release our RynnEC-2B model, RynnEC-Bench and training code.

🌟 Introduction

RynnEC is a video multi-modal large language model (MLLM) specifically designed for embodied cognition tasks.

πŸ› οΈ Requirements and Installation

Basic Dependencies:

  • Python >= 3.10
  • Pytorch >= 2.4.0
  • CUDA Version >= 11.8
  • transformers >= 4.46.3

Install required packages:

git clone https://github.com/alibaba-damo-academy/RynnEC
cd RynnEC
pip install -e .
pip install flash-attn --no-build-isolation

🌎 Model Zoo

Model Base Model HF Link
RynnEC-2B Qwen2.5-1.5B Alibaba-DAMO-Academy/RynnEC-2B

CookBook

Checkout inference notebooks that demonstrate how to use RynnEC on various applications such as basic object understanding, spatial understanding and video object segmentation in egocentric world.

Notebooks Description
Object Understanding Demonstrates how to use RynnEC for general object recognition and understanding
Spatial Understanding Demonstrations of using RynnEC for spatial understanding with 3D awareness
Video Object Segmentation Demonstrations of using RynnEC for video object segmentation with text-based instructions

πŸ€— Demo

It is highly recommended to try our online demo first.

Otherwise, you can launch a gradio app locally:

python inference/gradio_demo.py --model-path Alibaba-DAMO-Academy/RynnEC-2B

options:
  --model-path MODEL_PATH, --model_path MODEL_PATH
  --port SERVER_PORT, --server_port SERVER_PORT
  	Optional. Port of the model server.

πŸ•ΉοΈ RynnEC-Bench

RynnEC-Bench evaluates the models in two key areas: object cognition and spatial cognition, evaluating a total of 22 embodied cognitive abilities.

For more details, please refer to RynnEC-Bench.

πŸš€ Training

Step1: Prepare training data

To use our training code, please organize the annotation files in the following format:

[
    // image QA
    {
        "image": ["images/xxx.jpg"],
        "conversations": [
            {
                "from": "human",
                "value": "<image>\nWhat are the colors of the bus in the image?"
            },
            {
                "from": "gpt",
                "value": "The bus in the image is white and red."
            },
            ...
        ]
    },
    // Video QA
    {
        "video": ["videos/xxx.mp4"],
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nWhat are the main activities that take place in the video?"
            },
            {
                "from": "gpt",
                "value": "The main activities that take place in the video are the preparation of camera equipment by a man, a group of men riding a helicopter, and a man sailing a boat through the water."
            },
            ...
        ]
    },
    // Video-object QA (mp4 file)
    {
        "video": ["videos/xxx.mp4"],
        "conversations": [
            {
                "from": "human", 
                "value": "<video>\nWhat is the color of <region>?"
            }, 
            {
                "from": "gpt", 
                "value": "The color is red."
            }
        ],
        "masks": [
            {
                "frame id": {"size": [1080, 1920], "counts": "mask rle"},
                "frame id": {"size": [1080, 1920], "counts": "mask rle"}
            }
        ],
    },
    // Video-object QA (image files)
    {
        "video": ["videos/xxx/0.png", "videos/xxx/1.png", "videos/xxx/2.png", ...],
        "conversations": [
            {
                "from": "human", 
                "value": "<video>\nWhat is the color of <region>?"
            }, 
            {
                "from": "gpt", 
                "value": "The color is red."
            }
        ],
        "masks": [
            {
                "frame id": {"size": [1080, 1920], "counts": "mask rle"},
                "frame id": {"size": [1080, 1920], "counts": "mask rle"}
            }
        ],
        "mask_ids": ["the frame index of each mask in the video list"],
        "timestamps": ["timestamp of video frames"],
    },
    // Image-object QA
    {
        "video": ["images/xxx.jpg"],
        "conversations": [
            {
                "from": "human", 
                "value": "<video>\nWhat is the relationshipw between object1<region> and object2<region>?"
            }, 
            {
                "from": "gpt", 
                "value": "They are side by side."
            }
        ],
        "masks": [
            {"size": [1080, 1920], "counts": "mask rle"},
            {"size": [1080, 1920], "counts": "mask rle"}
        ],
    },
]

Step2: Prepare training script

We provide some templates in scripts/train for all stages. You can modify the variables to fit your settings of data and models based on them. For example:

  --data_folder ./datasets \
  --data_path stage4.json \
  --model_path Alibaba-DAMO-Academy/RynnEC-2B \
  --vision_encoder DAMO-NLP-SG/SigLIP-NaViT \

Step 3: Start training

Now you can start training with your training scripts:

# stage1
bash scripts/train/stage1.sh
# stage2
bash scripts/train/stage2.sh
...

Step 4: Merge LORA weights

If you use lora in the training stage, use the following command to merge the LORA weights after training:

python tools/merge_lora_weights.py --model_path checkpoints/stage4/checkpoint-xxx --save_path checkpoints/stage4_merge

βœ… Evaluation

1.RynnEC-Bench

Please prepare the datasets and question files used for evaluation here.

# for object property cognition
bash scripts/eval/eval_object_property.sh

# for object segmentation
bash scripts/eval/eval_seg.sh

# for spatial cognition
bash scripts/eval/eval_spatial.sh

Note:

Fill in the API_KEY, URL in the metrics.py first.

πŸ“‘ Citation

If you find RynnEC useful for your research and applications, please cite using this BibTeX:

πŸ‘ Acknowledgement

Our RynnEC is built on top of VideoLLaMA3. We also learned a lot from the implementation of VideoRefer, Sa2VA, and Qwen2VL. If your work is used in RynnEC but not mentioned in either this repo or the technical report, feel free to let us know ❀️.

πŸ”’ License

This project is released under the Apache 2.0 license as found in the LICENSE file. The service is a research preview intended for non-commercial use ONLY, subject to the model Licenses of Qwen, Terms of Use of the data generated by OpenAI and Gemini, and Privacy Practices of ShareGPT. Please get in touch with us if you find any potential violations.

About

RynnEC: Bringing MLLMs into Embodied World

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •