Skip to content

RT-Dino #33

@sebbyjp

Description

@sebbyjp

DinoV2 with registers backbone, transformer decoder, classifier free guidance film layers training script

☘️ Shoot an email to sebastian@mbodi.ai if you'd like to tackle this issue and I'll help as often as I can. Can provide A100 access once script is ready.

Starter Code
Example Doing Identical task but with MaxViT

Resources

Highly-Recommended Guide to Follow
Transformer Head Code
DinoV2 Source Code
Text Guidance with Film
RT1: Robotics Transformers paper

Tokenize Actions (x, y, z, roll, pitch, yaw, grasp)

Transform pattern: (b frames action) -> (b f a bins), bins=255

This is just simple classification not sequence to sequence modeling

  1. Apply MinMax Scaler

  2. Apply kbins

Apply film layers from classifier-free-guidance

Inference pattern: (b f c h w ), str --> (b f a bins)

Example Doing Identical task but with MaxViT

Details

  • Use pytorch lightning, transformers, or fastai (transformers preferred but fastai likely easiest)
  • Use pretrained ViT-g/14 small or large with registers
  • Start with basic encoder-decoder pattern (see the starter code script)

Use the following losses:

Follow-On Work

  • Ablations with early, middle, late fusion
  • Ablations with DinoV2 frozen, dinov2 without registers, smaller or larger dinov2
  • Whiten image inputs with PCA
  • AutoAugment with timm

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions