Self-supervised Vision Transformer Pretraining on Anim400K + AnitaDataset
- 🚀 Overview
- 🧠 Architecture
- 📦 Datasets
- ⚙️ Training Pipeline
- 🧪 Inference
- 📁 Code Structure
- ☁️ Deployment
- 📊 Results
- 🔬 Applications
- 📚 Citation
- 👤 Author
- 🚧 Roadmap
This project implements a foundation model for anime vision using:
- Masked Autoencoders (MAE) + Vision Transformer (ViT)
- Large-scale self-supervised learning
- Modular training pipeline
- Based on: Anim400K & AnitaDataset
Click to view MAE + ViT pipeline
Input Image (224x224)
→ Patch Embedding (16x16 patches)
→ ViT Encoder (masked tokens)
→ Lightweight MLP Decoder
→ Reconstructed Image (MSE Loss)
- ✅ Encoder: ViT-B/16 (timm pretrained)
- ✅ Decoder: Shallow MLP
- ✅ Loss: Mean squared error (masked pixels only)
🎥 Anim400K (pretraining)
datasets/anim400k/
├── video_clips/ # MP4 clips in folders
├── frames/ # Extracted images {video_id}/frame_XXXX.jpg
├── audio_clips/ # Optional .wav files
├── character_pics/ # Reference character images
└── splits.json # Annotations
🎴 AnitaDataset (fine-tuning)
datasets/anitadataset/
├── images/
├── annotations.json
└── metadata.json
python train_mae.py \
--data_root datasets/anim400k/frames \
--epochs 100 \
--batch_size 64 \
--lr 1e-4 \
--ckpt_dir checkpoints/
python train_anita.py \
--data_root datasets/anitadataset/images \
--annotations datasets/anitadataset/annotations.json \
--pretrained_ckpt checkpoints/mae_epoch_100.pt \
--epochs 30 \
--lr 5e-5
Use scripts/infer_reconstruction.py
:
python scripts/infer_reconstruction.py \
--model_ckpt checkpoints/mae_epoch_100.pt \
--image_path datasets/anim400k/frames/1234/frame_0001.jpg
Output: Original + Masked + Reconstructed grid
.
├── train_mae.py # MAE Pretraining
├── train_anita.py # Finetuning
├── models/ # MAEWrapper, ViT, Decoder
├── config/ # YAML configs
├── datasets/ # FrameDataset, Annotations
├── scripts/ # Inference, frame extractor
└── README.md
pip install gcsfs google-cloud-storage
# Upload datasets
gsutil cp -r datasets/anim400k gs://anim-foundation-avinash/
gsutil cp -r datasets/anitadataset gs://anim-foundation-avinash/
git init
git remote add origin https://github.com/avinash064/Avinash.git
git add .
git commit -m "Initial commit: MAE Foundation Model"
git push origin main
Dataset | MAE Loss ↓ | PSNR ↑ | SSIM ↑ |
---|---|---|---|
Anim400K | 0.029 | 23.4 | 0.78 |
AnitaDataset | 0.025 | 24.9 | 0.82 |
- Anime Character Reconstruction
- Facial Expression Synthesis
- Pose Transfer
- Lip Syncing
- Video Super-Resolution
@misc{Avinash2025MAEFoundation,
author = {Avinash Kashyap},
title = {MAE Pretraining on Anim400K and AnitaDataset},
year = 2025,
howpublished = {\url{https://github.com/avinash064/Avinash}},
}
Avinash Kashyap
🎓 AI & Medical Imaging | 🔬 Deep Learning | 🚀 Foundation Models
🔗 GitHub · LinkedIn
- MAE Pretraining (Anim400K)
- Finetuning (AnitaDataset)
- Cloud Upload (GCP Bucket)
- GitHub Project Integration
- Add Audio + Character Fusion
- Add HuggingFace Model Card
- Publish Paper + Demo Site
❤️ Star this repo and tag @avinash064 if you use this project!