RoboImagine: An Image-Text Conditioned, Generalized Robotic Video Generation Model Across Embodiments and Tasks
RoboImagine is an image-text conditioned diffusion model that generates long-term robotic manipulation videos, supporting multiple robot embodiments (robotic arms, mobile robots, etc.). The model achieves continuous and smooth long video generation through dynamic consistency enhancement techniques and integrates a VLM as a task completion verifier.
- Input: Text instruction + Robot type specification + 3 condition images
- Output: Long-term robotic manipulation videos (16 frames, extendable via autoregression)
- Technology: U-Net diffusion architecture + Dynamic consistency enhancement + VLM task verifier
- Image-Text Conditioned Generation: Generate robotic videos conditioned on both visual inputs and text instructions
- Cross-Embodiment Generalization: Works across different robot embodiments and manipulation tasks
- Long-Term Video Generation: Autoregressive pipeline for generating extended manipulation sequences
- VLM-Enhanced Quality: Vision-Language Model as task-completion evaluator
- Dynamic Consistency: Advanced augmentation techniques for smooth and continuous motions
RoboImagine is based on U-Net diffusion model architecture:
- Encoder: OpenCLIP text encoder + image encoder
- Diffusion Model: 3D U-Net for spatiotemporal processing
- Decoder: VAE for video frame reconstruction
- Conditioning Mechanism: Hybrid conditioning (text+image) + Image concatenation conditioning
- 150% Average Success Rate Improvement: Compared to methods without augmentation
- Effective Generalization: Excellent performance on unseen tasks and environments
- Dual Dataset Validation: Successfully evaluated on both RT-1 and Bridge datasets
- Python 3.8+
- PyTorch 2.0+
- CUDA 11.0+
# 1. Create conda environment
conda create -n roboimagine python=3.8
conda activate roboimagine
# 2. Clone repository (originmodel branch)
git clone https://github.com/Egbert-Lannister/Robo-Imagine.git
cd Robo-Imagine
# 3. Install dependencies
pip install -r requirements.txt
# Core dependencies include:
# - torch==2.0.0
# - torchvision
# - transformers==4.25.1
# - open_clip_torch==2.22.0
# - pytorch_lightning==1.9.3
# - opencv_python
# - decord==0.6.0
# - xformers
# - einops==0.3.0
Our model is based on the DynamiCrafter pre-trained model with fine-tuning and architectural modifications for robotic video generation.
# Download DynamiCrafter pre-trained model
# Create checkpoints directory
mkdir -p checkpoints
# Download from Hugging Face
# Visit: https://huggingface.co/Doubiiu/DynamiCrafter_512/tree/main
# Download the model files to checkpoints/ directory
# Or use git lfs (if you have git-lfs installed)
cd checkpoints
git lfs clone https://huggingface.co/Doubiiu/DynamiCrafter_512
# The model should be placed at:
# checkpoints/DynamiCrafter_512/model.ckpt
Model Details:
- Base Model: DynamiCrafter (512Γ320 resolution)
- Our Modifications: Fine-tuned on robotic datasets (RT-1, Bridge, UR5) with architectural enhancements
- Key Changes: Dynamic consistency augmentation, VLM integration, cross-embodiment conditioning
# Inference example (Python API)
import sys
sys.path.insert(0, '.')
from scripts.evaluation.inference import *
# Load model
config = OmegaConf.load("configs/inference_512_v1.0.yaml")
model = load_model_checkpoint(model, "path/to/checkpoint.ckpt")
# Prepare input
# Note: Input image size must be 512x320, otherwise output will have black borders
text_instruction = "open the microwave"
condition_images = [img1, img2, img3] # 3 condition images
# Generate video
video = model.generate(
text_instruction=text_instruction,
condition_images=condition_images,
embodiment="bridge" # Supported: bridge, rt1, ur5
)
# Modify the three key paths in the script:
# 1. ckpt: Point to the downloaded checkpoint file
# 2. prompt_dir: Input images and prompt text directory
# 3. res_dir: Output video save directory
# Run inference (512 resolution)
sh scripts/run.sh 512
# Input data format
# prompt_dir/
# βββ image1.jpg # Condition images
# βββ image2.jpg
# βββ image3.jpg
# βββ prompts.txt # One text instruction per line
- β Pick and place operations
- β Drawer manipulation (open/close)
- β Object movement and positioning
- β Container manipulation
- β Kitchen appliance operation (fridge, microwave)
- β Tool manipulation (knife, cutting board)
- β Complex multi-step tasks
- β Object arrangement and organization
- RT-1: 599 robotic manipulation videos
- Bridge: 513 robotic manipulation videos
- berkeley_autolab_ur5: UR5 robotic arm data
# RT1-UR5-Bridge_train.csv example
video_path,text_instruction,embodiment
/data/bridge/fridge_open.mp4,"open the fridge","bridge"
/data/rt1/drawer_close.mp4,"close the drawer","rt1"
/data/ur5/pick_apple.mp4,"pick up the apple","ur5"
- Important Notice: Input images must be resized to 512Γ320 resolution
- Otherwise, the model output videos will have black border issues
- Data loader will automatically perform center cropping and normalization
# Modify training configuration
# configs/training_512_v1.0/config.yaml key parameters:
# - frame_stride: 2 # Core modification in originmodel branch
# - resolution: [320, 512] # HeightΓWidth
# - video_length: 16
# - meta_path: /path/to/RT1-UR5-Bridge_train.csv
# Start training
sh configs/training_512_v1.0/run.sh
# Training parameter explanation:
# - batch_size: 2
# - learning_rate: 1e-5
# - max_steps: 100000
# - checkpoint_every: 5000 steps
# - Pre-trained model: Based on DynamiCrafter
# Training checkpoint save path
checkpoints/
βββ dynamicrafter_512_v1/model.ckpt # Pre-trained model
βββ training_512_v1.0/
βββ epoch_174-step_18000.ckpt # Training checkpoint
βββ trainstep_checkpoints/
RoboImagine/
βββ configs/ # Configuration files
β βββ inference_512_v1.0.yaml # Inference configuration
β βββ training_512_v1.0/ # Training configuration
β βββ config.yaml # Main configuration file
β βββ run.sh # Training script
βββ lvdm/ # Core model code
β βββ models/
β β βββ ddpm3d.py # Diffusion model backbone
β β βββ autoencoder.py # VAE encoder
β β βββ samplers/ # Samplers
β βββ modules/
β β βββ encoders/
β β β βββ condition.py # Condition encoder (image+text)
β β β βββ resampler.py # Image resampler
β β βββ networks/
β β βββ openaimodel3d.py # 3D U-Net
β βββ data/
β βββ webvid.py # Dataset loader (supports robot types)
βββ scripts/ # Run scripts
β βββ run.sh # Inference script
β βββ evaluation/
β βββ inference.py # Inference entry point
β βββ funcs.py # Utility functions
βββ main/ # Training related
β βββ trainer.py # Training entry point
β βββ utils_data.py # Data processing utilities
β βββ callbacks.py # Training callbacks
βββ utils/ # Utility functions
β βββ utils.py # General utilities
β βββ save_video.py # Video saving utilities
βββ prompts/ # Example prompts
βββ 512/ # 512 resolution examples
βββ 1024/ # 1024 resolution examples
lvdm/models/ddpm3d.py
: Diffusion model backbone, handles spatiotemporal diffusion processlvdm/modules/encoders/condition.py
: Condition encoder, processes text and image conditionslvdm/data/webvid.py
: Dataset loader, supports robot type images and video datascripts/evaluation/inference.py
: Inference entry point, supports batch video generationconfigs/training_512_v1.0/config.yaml
: Training configuration file, contains all model parameters
- RT-1 Dataset: 599 robotic manipulation videos with different embodiments and tasks
- Bridge Dataset: 513 robotic manipulation videos with various manipulation scenarios
- Success Rate: 150% improvement compared to methods without augmentation
- Temporal Consistency: Significantly improved through dynamic consistency enhancement
- Generalization Capability: Excellent performance on unseen tasks and environments
@article{robo_imagine_2024,
title={Robo-Imagine: An Image-Text Conditioned, Generalized Robotic Video Generation Model Across Embodiments and Tasks},
author={[Authors will be added]},
journal={[Journal/Conference will be added]},
year={2024}
}
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
- Email: egbertlannister@gmail.com
- Issues: Please use GitHub Issues for technical questions
- Project Page: https://egbert-lannister.github.io/Robo-Imagine/
- Thanks to the DynamiCrafter project for providing the base model framework
- Thanks to RT-1 and Bridge dataset contributors
- Thanks to open-source projects like OpenCLIP and PyTorch Lightning
β If this project is helpful to you, please give us a star!
π§ Subscribe to notifications or watch this repository for the latest updates and feature releases.