Skip to content

Integrating Natural Language Instructions into the Action Chunking Transformer for Multi-Task Robotic Manipulation

License

Notifications You must be signed in to change notification settings

krohling/nl-act

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Integrating Natural Language Instructions into the Action Chunking Transformer for Multi-Task Robotic Manipulation

Note: This repository is a fork of the original ACT repository by Tony Z. Zhao. The original repository contains the implementation of the Action Chunking Transformer (ACT) model for robotic manipulation. This fork extends the ACT model to support multi-task robotic manipulation based on natural language instructions. The code and scripts in this repository are used to generate instruction embeddings, record multi-task episodes, train the ACT model, and evaluate the model on unseen instructions.

Training Results


Abstract

We address the challenge of enabling robots to perform multiple manipulation tasks based on natural language instructions by integrating text embeddings into the Action Chunking Transformer (ACT) model developed by Zhao et al. [2023]. Specifically, we modify the ACT architecture to accept embeddings of task instructions generated using the all-mpnet-base-v2 [Song et al., 2020] model, and integrate them into the transformer [Vaswani et al., 2023] encoder’s input. To introduce generalization across task instruction phrasings, we generate a diverse set of paraphrased instructions for each task using GPT-4o [OpenAI et al., 2024] and use random instruction sampling during training to prevent overfitting trajectories to specific instructions. Our experiments, conducted in a simulated Mujoco [Todorov et al., 2012] environment with a bimanual ViperX robot, focus on three tasks: object grasping, stacking, and transfer. We collect two datasets composed of 50 episodes per task with randomized object placements, one with noise applied to trajectories during data collection [Tangkaratt et al., 2021] and one without noise. For each task we use 25 task instruction phrasings during training and hold out 10 for evaluation to assess generalization. The modified ACT model achieves an overall success rate of 89.3% across the three tasks, demonstrating the ability to map unseen task instructions to robotic control sequences.


Results

For details on the experiments, results, and analysis, please refer to the paper.

Dataset Training Iteration Overall (%) Grasping (%) Stacking (%) Transfer (%)
With Noise Injection 22500 88.7 100.0 70.0 96.0
Without Noise Injection 40000 89.3 100.0 80.0 88.0

Training Results


Repository Overview

This repository contains code for extending the Action Chunking Transformer (ACT) to enable multi-task robotic manipulation based on natural language instructions. We integrate text embeddings into the ACT architecture, allowing robots to execute tasks in response to natural language instructions.

The main scripts are as follows:

  • generate_instruction_embeddings.py: Converts raw text instructions into sentence embeddings and saves them to train/val CSV files for use in training.
  • record_multi_task_episodes.py: Records episodes for each task type (grasp, stack, transfer) and saves them to an HDF5 dataset.
  • train.py: Trains the ACT model on the specified multi-task episodes and instruction embeddings.
  • eval_checkpoint.py: Evaluates a trained checkpoint on a given instruction or CSV file with instruction embeddings.

If a Weights & Biases API Key is supplied all results will be logged and uploaded there. Note that the easiest way to use this repository is with the pre-built Docker container on DockerHub.


Docker Container

This project provides a Docker container, hosted on DockerHub, that includes the simulation environment, training scripts, datasets, and all other dependencies needed to reproduce the results. It will automatically download the pre-generated datasets stored on HuggingFace for training and is especially convenient for running on GPU-based cloud services such as RunPod.io. The Docker image is pre-built for the amd64 architecture and will not run on Apple Silicon environments natively. You really want a GPU environment for training anyway.

Using the Pre-Built Docker Image

  1. Pull the Docker Image:

    docker pull krohling/nl-act
  2. Run the Docker Container: The container includes all necessary dependencies (MuJoCo, dm_control, PyTorch, etc.). You can run it with the default configurations or set environment variables for more control. Here’s an example run command that also mounts a local output directory and sets a few environment variables:

    docker run -it \
        --gpus all \
        -v $(pwd)/output:/opt/ml/output \
        -e WANDB_PROJECT="my_wandb_project" \
        -e WANDB_ENTITY="my_wandb_entity" \
        -e WANDB_API_KEY="my_wandb_api_key" \
        krohling/nl-act

    Replace my_wandb_project, my_wandb_entity, and my_wandb_api_key with your actual Weights & Biases credentials if you wish to log and visualize your training metrics.

Building the Docker Image Locally

If you prefer to build the Docker image locally (for example, to tweak dependencies), you can use the provided Dockerfile:

  1. Build the Image:

    docker build -t nl-act .
  2. Run the Image:

    docker run -it \
        --gpus all \
        -v $(pwd)/output:/opt/ml/output \
        -e WANDB_PROJECT="my_wandb_project" \
        -e WANDB_ENTITY="my_wandb_entity" \
        -e WANDB_API_KEY="my_wandb_api_key" \
        nl-act

Environment Variables

Below is a list of environment variables recognized by the Docker container (and used by the train.py script). Most of these have defaults, so you can override them as needed. Note: While the container will execute without customizing any of these variables, it is highly recommended to modify the batch size to fit your environment.

Variable Default Description
LR 1e-5 Learning rate.
BATCH_SIZE 1 Batch size used for both training and validation.
NUM_EPOCHS 30000 Number of epochs (training iterations) to run.
DATASET_DIR ./dataset Path to the directory containing the .hdf5 training dataset.
TRAIN_INSTR_PATH ./data/instruction_embeddings.train.csv CSV with training instruction embeddings.
VAL_INSTR_PATH ./data/instruction_embeddings.val.csv CSV with validation instruction embeddings.
CKPT_DIR ./output/checkpoints Directory to save model checkpoints.
CKPT_FREQUENCY 2500 Save model checkpoints every N epochs.
LOAD_CKPT_PATH None If specified, continue training from this checkpoint.
SEED 0 Random seed for reproducibility.
EVAL False Whether to evaluate the model during training.
EVAL_INSTR_PATH ./data/instruction_embeddings.val.csv CSV for evaluation instruction embeddings.
EVAL_FREQUENCY 2500 Run evaluation every N epochs (only if --eval is true).
EVAL_WAIT 0 How many epochs to wait before starting evaluations.
NUM_ROLLOUTS 10 Number of rollouts to perform for each evaluation block.
VIDEOS_DIR ./output/videos Path to store rendered rollout videos (if any).
ONSCREEN_RENDER False Whether to render the environment onscreen during evaluation.
TEMPORAL_AGG False Enable or disable temporal aggregation at inference.
CHUNK_SIZE 100 Number of queries used in the action chunking process.
KL_WEIGHT 10 KL divergence weight factor in training.
HIDDEN_DIM 512 Hidden dimension size in the transformer.
DIM_FEEDFORWARD 3200 Dimensionality of the feedforward layers in the transformer.
STATE_DIM 14 Dimensionality of the robot state representation.
LR_BACKBONE 1e-5 Learning rate for the vision backbone.
BACKBONE resnet18 Vision backbone to use (e.g., resnet18).
ENC_LAYERS 4 Number of transformer encoder layers.
DEC_LAYERS 7 Number of transformer decoder layers.
NHEADS 8 Number of attention heads in the transformer.

Adjusting these variables allows you to customize data paths, training hyperparameters, and evaluation settings without modifying the scripts.
If you have Weights & Biases credentials, set WANDB_PROJECT, WANDB_ENTITY, and WANDB_API_KEY to log and track training metrics.


Tip: When running on a headless server (e.g., HPC or a cloud GPU instance), make sure to omit --onscreen_render, since it requires a graphical interface.


Local Environment

To set up a local environment:

  1. Clone this repository:

    git clone https://github.com/krohling/nl-act.git
    cd nl-act
  2. Create a Conda Environment and install dependencies:

    conda create -n nl-act python=3.10
    conda activate nl-act
    pip install -r requirements.txt

Note: You may need system dependencies for MuJoCo (see dm_control for instructions) and GPU drivers for PyTorch.

  1. Evaluate an example checkpoint:
    git clone https://huggingface.co/kevin510/nl-act checkpoint
    python eval_checkpoint.py \
      --ckpt_path checkpoint/nl-act.ckpt \
      --instruction "Right arm, grasp the red block." \
      --num_rollouts 3 \
      --videos_dir ./output/eval_videos \
      --onscreen_render

Scripts and Usage

  1. generate_instruction_embeddings.py

Use generate_instruction_embeddings.py to convert raw text instructions into sentence embeddings. Each input file should contain one instruction per line. By default, 25 instructions per task are used for training and 10 for validation.

python generate_instruction_embeddings.py  \
  --task_ids 0 1 2  \
  --input_files ./data/grasp-instructions.txt \
                ./data/stack-instructions.txt \
                ./data/transfer-instructions.txt \
  --train_output_file ./data/instruction_embeddings.train.csv \
  --val_output_file   ./data/instruction_embeddings.val.csv

Arguments:

  • --input_files: Paths to text files (one instruction per line).
  • --task_ids: Numeric IDs corresponding to each task file.
  • --train_output_file / --val_output_file: CSV outputs containing (task_id, instruction, embedding) columns.
  1. record_multi_task_episodes.py

Use record_multi_task_episodes.py to record episodes for each task type (grasp, stack, transfer) and save them as HDF5 files. The output dataset can be used directly for training the ACT model.

python record_multi_task_episodes.py \
  --output_dir ./dataset/act-grasp-stack-transfer.hdf5 \
  --num_episodes 150 \
  --inject_noise \
  --onscreen_render \
  1. train.py

Use train.py to train the ACT model on multi-task episodes and instruction embeddings. The script supports evaluation during training and can save rollout videos. If Weights & Biases credentials are configured in the environment, training metrics and evaluation videos will be logged to the specified project.

python train.py \
  --dataset_dir ./dataset \
  --num_epochs 30000 \
  --batch_size 1 \
  --train_instr_path ./data/instruction_embeddings.train.csv \
  --val_instr_path ./data/instruction_embeddings.val.csv \
  --ckpt_dir ./output/checkpoints \
  --ckpt_frequency 2500 \
  --eval --num_rollouts 1 \
  --eval_instr_path ./data/instruction_embeddings.val.csv \
  --eval_frequency 2500 \
  --videos_dir ./output/videos \
  1. eval_checkpoint.py.py

Use eval_checkpoint.py to run offline evaluation of a checkpoint. You can specify either a CSV file with embeddings or a single text instruction:

python eval_checkpoint.py \
  --ckpt_path ./output/checkpoints/policy_checkpoint_10000.ckpt \
  --eval_instr_path ./data/instruction_embeddings.val.csv \
  --num_rollouts 10 \
  --videos_dir ./output/eval_videos \
  --onscreen_render

or provide a single instruction:

python eval_checkpoint.py \
  --ckpt_path ./output/checkpoints/policy_checkpoint_10000.ckpt \
  --instruction "Right arm, grasp the red block." \
  --num_rollouts 3 \
  --videos_dir ./output/eval_videos \
  --onscreen_render

Datasets

We provide two datasets for training:

Each dataset is ~75GB in size and stored in .hdf5 format. These datasets include everything needed for training including recordings of multi-task episodes as well as instruction embeddings.


Runtime

These experiments were run using an NVIDIA A40 (48GB VRAM) GPU on Runpod. Setting the batch size to 150 training on one dataset completed in approximately 48 hours. At least 150GB of disk space is recommended to avoid process termination due to insufficient storage.


References

Citation

If you use this project in your work, please cite the following:

@misc{rohling2024NLACT,
  title={Integrating Natural Language Instructions into the Action Chunking Transformer for Multi-Task Robotic Manipulation},
  author={Rohling, Kevin},
  year={2024},
  howpublished={\url{https://github.com/krohling/nl-act}},
}

Enjoy exploring natural language task specification in ACT-based robotic manipulation! If you have any questions or issues, please file an issue or contact me directly.

About

Integrating Natural Language Instructions into the Action Chunking Transformer for Multi-Task Robotic Manipulation

Topics

Resources

License

Stars

Watchers

Forks