-
Notifications
You must be signed in to change notification settings - Fork 132
Description
Video models are coming one after another (recently Wan2.1) but they are becoming increasingly heavy and impossible to use on local machines with a simple decent GPU. I understand that this corresponds to research and that academics often have heavy training resources, but ultimately who is the production of short videos of a few seconds intended for?
I have used the LTX Video model quite a bit, which despite all its shortcomings is the only model that is pleasant to use. It can produce 10-second videos in 512x896 (a quarter of a 4K UHD video) respecting the division by 32 of the resolution on my small 16GB GPU, and this in one minute. And again, it is not so much the sampling that consumes VRAM, it is the VAE decoding. From now on, Hugging Face offers a free online decoding service.
There are some issues with the LTX Video model:
- Training was done on poorly calibrated data. Sometimes generated videos have the "BBC" logo and often the produced video includes overlayed advertising inserts. It seems that the source of the training data is, in whole or in part, cable television.
- Adherence to the prompt is very haphazard. Despite respecting the prompting rules, the best results come from the example prompts provided on the official page.
- The official scientific paper clearly states that the largest quantity of videos used for training is less than 4 seconds long and decreases exponentially up to 30 seconds. Suffice to say that the amount of data of about 10 seconds is small, which has the impact of creating videos that fail as the duration increases.
- In terms of human anatomy, it is clear that the best results are achieved on relatively static portraits, do not imagine making a 10-second video with a coherent movement of a complete body, it is a waste of time.
That is why I intend to try a fine-tuning of the model with finetrainers.
My goal is to finalize my video corpus whose overall characteristics are:
- Resolution: 512x896 (portrait) 896x512 (landscape)
- Duration: 10 seconds for all videos.
- Source: 4K stock footage downscaled with precision and quality.
- Type of video: mainly human in all possible cases: walking, dancing, yoga, sports, running, jumping and so on. Some close-up videos for details of the eyes and skin. Indoor videos, outdoor videos, in all situations including street scenes.
My question is rather simple? I have access to hundreds of hours of UHD videos, I am sorting and managing the prompting with the best current multimodal models.
In your opinion, what is the ideal quantity for this type of finetuning? If there is an ideal quantity of course!