Skip to content

AgibotTech/Genie-Envisioner

Repository files navigation

Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

Overview

   

This repo is the official implementation of Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation.

News

TODO

  • Release inference & training code
  • Release model weights
  • Support more backbone models

Getting started

Setup

git clone https://github.com/AgibotTech/Genie-Envisioner.git
conda create -n genie_envisioner python=3.10.4
conda activate genie_envisioner
pip install -r requirements.txt

Training

GE-Act Post-Training

  1. Download the pretrained weights of GE-base and the weights of tokenizer and vae used in LTX_Video from HuggingFace, and modify the model weight config in configs/ltx_model/video_model.yaml:

    pretrained_model_name_or_path: PATH/TO/PRETRAINED_WEIGHTS_OF_VAE_AND_TOKENIZER
    diffusion_model:
    model_path: PATH/TO/GE_base_{version}.safetensors
    
  2. Build your own LeRoBot dataset following the instruction in LeRoBot and a conversion script of AgiBotWorld.

    File Structure Example:

    ROOT_PATH_TO_YOUR_DATASETS/
    ├── DATASETNAME/
    │   ├── data/
    │   │   ├── episode_000000.parquet
    │   │   ├── episode_000001.parquet
    │   │   ├── ...
    │   │   └── episode_{:06d}.parquet
    │   ├── meta/
    │   │   ├── episodes_stats.jsonl
    │   │   ├── episodes.jsonl
    │   │   ├── tasks.json
    │   │   └── info.json
    │   └── videos/
    │       ├── chunk-000/
    │       |   ├── observation.images.top_head
    │       |   |   ├── episode_000000.mp4
    │       |   |   ├── episode_000001.mp4
    │       |   |   ├── ...
    │       |   |   └── episode_{:06d}.mp4
    │       |   ├── observation.images.hand_left
    │       |   |   ├── episode_000000.mp4
    │       |   |   └── ...
    │       |   └── observation.images.hand_right
    │       |   |   ├── episode_000000.mp4
    │       |       └── ...
    |       └── ...
    └── ...
    
  3. Calculate the action statistics and add them to data/utils/statistics.py.

    {
        "DATASETNAME_joint": {
            "mean": [
                0,
                ...
            ],
            "std":[
                1,
                ...
            ]
        },
        "DATASETNAME_delta_joint": {
            "mean": [
                0,
                ...
            ],
            "std":[
                1,
                ...
            ]
        }
        "DATASETNAME_state_joint": {
            "mean": [
                0,
                ...
            ],
            "std":[
                1,
                ...
            ]
        }
    }
    
  4. Task-specific video adaption

    As mentioned in our paper, although GE-base has zero-shot capability, for the unseen robots or customized new tasks, we recommend performing this step of video adaptation to achieve better performance.

    1. Modify the config in configs/ltx_model/video_model_lerobot.yaml. More details of dataset can be found in data/utils/*_dataset.py:
    data:
        train / val:
            data_roots:   [ROOT_PATH_TO_YOUR_DATASETS, ]
            domains:      [DATASETNAME, ]
            # rewrite to the camera names used in your dataset
            valid_cam:    ["observation.images.top_head", "observation.images.hand_left", "observation.images.hand_right"]
            ...
    
    1. Disable action-model as bellow in configs/ltx_model/video_model_lerobot.yaml:
    return_action: False
    return_video: True
    train_mode: 'video_only'
    diffusion_model:
        config:
            action_expert: False
    
    1. Run
    bash scripts/train.sh main.py configs/ltx_model/video_model_lerobot.yaml
    
  5. Action Post-Training

    1. Modify the config in configs/ltx_model/policy_model_lerobot.yaml
    diffusion_model:
        model_path: PATH_TO_VIDEO_POST_TRAINING_CHECKPOINT_SAFETENSOR
    data:
        train / val:
            data_roots:   [ROOT_PATH_TO_YOUR_DATASETS, ]
            domains:      [DATASETNAME, ]
            # rewrite to the camera names used in your dataset
            valid_cam:    ["observation.images.top_head", "observation.images.hand_left", "observation.images.hand_right"]
            # rewrite to the keys used in your dataset
            action_key:   "action"
            state_key:    "observation.state" 
            action_type:  "absolute"  # "absolute", "delta" or "relative"
            action_space: "joint"
            ...
    

    More details of dataset can be found in data/utils/*_dataset.py

    1. Enable action-model as bellow in configs/ltx_model/policy_model_lerobot.yaml:
    return_action: True
    return_video: False
    train_mode: 'action_full'
    diffusion_model:
        config:
            action_expert: True
    
    1. Run
    bash scripts/train.sh main.py configs/ltx_model/policy_model_lerobot.yaml
    

GE-base Pre-Training

You can also train GE-base on your own database. Here, we take training on AgiBotWorld as an example:

  1. Download 🤗AgiBotWorld

  2. Modify dataset config in configs/ltx_model/video_model.yaml:

    data:
        train / val:
            data_roots: ["path/to/agibot-world/AgiBotWorld-Beta", ]
            task_info_root: ["path/to/agibot-world/AgiBotWorld-Beta/task_info", ]
            domains: ["agibotworld", ]
            ...
            dataset_info_cache_path: "path/to/save/dataset_meta_info_cache"
    
  3. Download the weights of tokenizer and vae used in LTX_Video from HuggingFace and the pretrained weights of GE-Base, and modify the model weight config in configs/ltx_model/video_model.yaml:

    pretrained_model_name_or_path: PATH/TO/PRETRAINED_WEIGHTS_OF_VAE_AND_TOKENIZER
    diffusion_model:
    model_path: PATH/TO/GE_base_{version}.safetensors
    
  4. Pre-train Video-Model

    bash scripts/train.sh main.py configs/ltx_model/video_model.yaml
    

Validation

Predict actions and draw an open-loop verification diagram

bash scripts/infer.sh main.py \
    configs/ltx_model/policy_model_lerobot.yaml \
    path/to/trained/checkpoint.safetensors \
    path/to/save/outputs \
    DATASETNAME

GE-Act Deployment

We provide a simple example of deploying GE-Act server based on openpi:

# GE-Act server
# modify $IP_ADDRESS_OF_SERVER to your ip address and modify $DOMAIN_NAME to DATASETNAME
bash web_infer_scripts/run_server.sh

# A simple client that send random observations
bash web_infer_scripts/run_simple_client.sh

Video Generation

You can generate videos as bellow:

bash scripts/infer.sh main.py \
    configs/ltx_model/video_model_infer_slow.yaml \
    path/to/trained/checkpoint.safetensors \
    path/to/save/outputs \
    DATASETNAME

We also provide two examples in video_gen_examples and a simple script to generate videos. As described in our paper, the video generation model takes sparse memory frames as input. Therefore, each sample in video_gen_examples includes four multi-view images sampled from history frames.

python examples/infer.py \
    --config_file configs/ltx_model/video_model_infer_slow.yaml \
    --image_root video_gen_examples/sample_0 \
    --prompt_txt_file video_gen_examples/sample_0/prompt.txt \
    --output_path path/to/save/results

As detailed in our paper, we provide two pre-trained video generation models:

  • GE-Base-slow (Mid-Range frequency video generation, synchronized with action dynamics)
  • GE-Base-fast (Low-Frequency video generation optimized for low-latency applications)

When utilizing these models, please select the appropriate configuration file and ensure the diffusion_model.model_path parameter correctly points to your chosen model weights

Citation

@article{liao2025genie,
  title={Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation},
  author={Liao, Yue and Zhou, Pengfei and Huang, Siyuan and Yang, Donglin and Chen, Shengcong and Jiang, Yuxin and Hu, Yue and Cai, Jingbin and Liu, Si and Luo, Jianlan, Chen Liliang, Yan Shuicheng, Yao Maoqing, Ren Guanghui},
  journal={arXiv preprint arXiv:2508.05635},
  year={2025}
}

Acknowledgment

License

Codes in the directory models/ltx_models, models/pipeline and web_infer_utils/openpi_client are modified from Diffusers, LTX-Video and openpi, which means these codes under Apache License 2.0.

Other data and codes within this repo are under CC BY-NC-SA 4.0.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •