Training-free, mask-guided attention for sharper foreground details in video generation.
-
Temporal Attention: Improves consistency and coherence across video frames by leveraging Freelong’s temporal attention.
-
Mask Attention: Enables fine-grained control over object-level details through mask-driven attention refinement.
-
Text Prompt Refinement: Users can specify objects or regions of interest in the text prompt to guide the video regeneration process.
-
Initial Video Generation: Generate a rough video sequence from the user’s text prompt using base model.
-
Object Localization: Use GroundingDINO along with user-specified target objects to locate bounding boxes in the initial video frames.
-
Mask Extraction: Feed the detected bounding boxes into Segment Anything Model to produce foreground masks for the target objects.
-
Attention Masking: Incorporate the foreground masks into the attention mechanism by updating the attention mask.
-
Video Regeneration: Regenerate the video from the original text prompt, guided by the refined attention masks to produce a more detailed result.
Install the required packages using the provided environment.yaml file:
conda env create -f environment.yaml
conda activate MRVG
Download pre-trained LaVie models, Stable Diffusion 1.4, stable-diffusion-x4-upscaler to ./pretrained_models. You should be able to see the following:
├── pretrained_models
│ ├── lavie_base.pt
│ ├── lavie_interpolation.pt
│ ├── lavie_vsr.pt
│ ├── stable-diffusion-v1-4
│ │ ├── ...
└── └── stable-diffusion-x4-upscaler
├── ...
Follow the official instructions for each model to integrate them into the project(Clone SAM and GroundingDINO)
Below is the complete project structure:
├── MRVG
├── GroundingDINO
├── pretrained_models
├── segment-anything
├── environment.yaml
└── requirements.txt
Specify your text prompt and target objects in configs/sample_mrvg.yaml:
text_prompt: [
"Your descriptions here"
]
grounding_prompt: "Your grounding object"
Then run the entire pipeline with:
bash start.sh