Skip to content

avinash064/Avinash

Repository files navigation

🧠 MAE-ViT Foundation Model for Anime Understanding

Self-supervised Vision Transformer Pretraining on Anim400K + AnitaDataset


📑 Table of Contents


🚀 Overview

This project implements a foundation model for anime vision using:

  • Masked Autoencoders (MAE) + Vision Transformer (ViT)
  • Large-scale self-supervised learning
  • Modular training pipeline
  • Based on: Anim400K & AnitaDataset

🧠 Architecture

Click to view MAE + ViT pipeline
Input Image (224x224)
→ Patch Embedding (16x16 patches)
→ ViT Encoder (masked tokens)
→ Lightweight MLP Decoder
→ Reconstructed Image (MSE Loss)
  • ✅ Encoder: ViT-B/16 (timm pretrained)
  • ✅ Decoder: Shallow MLP
  • ✅ Loss: Mean squared error (masked pixels only)

📦 Datasets

🎥 Anim400K (pretraining)
datasets/anim400k/
├── video_clips/             # MP4 clips in folders
├── frames/                  # Extracted images {video_id}/frame_XXXX.jpg
├── audio_clips/            # Optional .wav files
├── character_pics/         # Reference character images
└── splits.json              # Annotations
🎴 AnitaDataset (fine-tuning)
datasets/anitadataset/
├── images/
├── annotations.json
└── metadata.json

⚙️ Training Pipeline

🧠 MAE Pretraining (on Anim400K)

python train_mae.py \
  --data_root datasets/anim400k/frames \
  --epochs 100 \
  --batch_size 64 \
  --lr 1e-4 \
  --ckpt_dir checkpoints/

🎯 Fine-tuning (on AnitaDataset)

python train_anita.py \
  --data_root datasets/anitadataset/images \
  --annotations datasets/anitadataset/annotations.json \
  --pretrained_ckpt checkpoints/mae_epoch_100.pt \
  --epochs 30 \
  --lr 5e-5

🧪 Inference

Use scripts/infer_reconstruction.py:

python scripts/infer_reconstruction.py \
  --model_ckpt checkpoints/mae_epoch_100.pt \
  --image_path datasets/anim400k/frames/1234/frame_0001.jpg

Output: Original + Masked + Reconstructed grid


📁 Code Structure

.
├── train_mae.py            # MAE Pretraining
├── train_anita.py          # Finetuning
├── models/                 # MAEWrapper, ViT, Decoder
├── config/                 # YAML configs
├── datasets/               # FrameDataset, Annotations
├── scripts/                # Inference, frame extractor
└── README.md

☁️ Deployment

☁️ Upload to Google Cloud

pip install gcsfs google-cloud-storage

# Upload datasets
gsutil cp -r datasets/anim400k gs://anim-foundation-avinash/
gsutil cp -r datasets/anitadataset gs://anim-foundation-avinash/

🌐 Push to GitHub

git init
git remote add origin https://github.com/avinash064/Avinash.git
git add .
git commit -m "Initial commit: MAE Foundation Model"
git push origin main

📊 Results

Dataset MAE Loss ↓ PSNR ↑ SSIM ↑
Anim400K 0.029 23.4 0.78
AnitaDataset 0.025 24.9 0.82

🔬 Applications

  • Anime Character Reconstruction
  • Facial Expression Synthesis
  • Pose Transfer
  • Lip Syncing
  • Video Super-Resolution

📚 Citation

@misc{Avinash2025MAEFoundation,
  author = {Avinash Kashyap},
  title = {MAE Pretraining on Anim400K and AnitaDataset},
  year = 2025,
  howpublished = {\url{https://github.com/avinash064/Avinash}},
}

👤 Author

Avinash Kashyap
🎓 AI & Medical Imaging | 🔬 Deep Learning | 🚀 Foundation Models
🔗 GitHub · LinkedIn


🚧 Roadmap

  • MAE Pretraining (Anim400K)
  • Finetuning (AnitaDataset)
  • Cloud Upload (GCP Bucket)
  • GitHub Project Integration
  • Add Audio + Character Fusion
  • Add HuggingFace Model Card
  • Publish Paper + Demo Site

❤️ Star this repo and tag @avinash064 if you use this project!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages