-
Notifications
You must be signed in to change notification settings - Fork 27
Open
Description
DinoV2 with registers backbone, transformer decoder, classifier free guidance film layers training script
☘️ Shoot an email to sebastian@mbodi.ai if you'd like to tackle this issue and I'll help as often as I can. Can provide A100 access once script is ready.
Starter Code
Example Doing Identical task but with MaxViT
Resources
Highly-Recommended Guide to Follow
Transformer Head Code
DinoV2 Source Code
Text Guidance with Film
RT1: Robotics Transformers paper
Tokenize Actions (x, y, z, roll, pitch, yaw, grasp)
Transform pattern: (b frames action) -> (b f a bins), bins=255
This is just simple classification not sequence to sequence modeling
-
Apply MinMax Scaler
-
Apply kbins
Apply film layers from classifier-free-guidance
Inference pattern: (b f c h w ), str --> (b f a bins)
Example Doing Identical task but with MaxViT
Details
- Use pytorch lightning, transformers, or fastai (transformers preferred but fastai likely easiest)
- Use pretrained ViT-g/14 small or large with registers
- Start with basic encoder-decoder pattern (see the starter code script)
Use the following losses:
- Asymmetric loss: https://timm.fast.ai/asymmetric_loss
- CutUpMix: https://timm.fast.ai/random_resized_crop
- Standard Image Augmentations: https://timm.fast.ai/random_resized_crop
- Repeat with Film Layers added to ViT blocks and no caption caption input using: classifier-free-guidance (again refer to the MaxViT example above).
Follow-On Work
- Ablations with early, middle, late fusion
- Ablations with DinoV2 frozen, dinov2 without registers, smaller or larger dinov2
- Whiten image inputs with PCA
- AutoAugment with timm
Metadata
Metadata
Assignees
Labels
No labels