ControlNet
We can get some video like edge, depth or pose, and then pass it to stable diffusion to control the video generation.
For a funny testing, I pass the depth video of a fox to Stable Diffusion,

and then ask it to generate a video with text prompt "oil painting of a deer, a high-quality, detailed, and professional photo", then I get a foxy deer, which looks like this:
