Durable video generation (golem:video-generation) #51

Nanashi-lab · 2025-06-27T10:28:21Z

/closes #44
/claim #44

Nanashi-lab · 2025-06-27T11:07:36Z

@jdegoes hi, help me with some clarifications and wit changes. I have also proposed a wit with changes, in the next comment.

Current State of PR (Completed parts)

Image-to-Video: All providers support this natively.
Text-to-Video:
Runway and Stability, lack native support,we do text-to-image generation, followed by an image-to-video generation.
Durability and test component

Wit Changes

Config

enum image-role {
general,
style,
character,
composition,
}

This enum does not align with the any of the api,
Suggested replacement -

[First, Last]. Runway, Veo and Kling all support, specify if the image is first frame or last frame.

record character-consistency {
reference-images: list,
strength: option,
}

record style-consistency {
reference-images: list,
strength: option,
}

This config is from runway text-to-image, since I am doing text-to-image, as part of text-to-video, I can fit this, but feels out of place and better in golem:image. Character consistency and style consistency is maintained by default for all providers.
Suggest replacements -

LastFrame (Kling, can accept both first and last frame)
Multi-Image to video (Kling only, this is a separate endpoint, moved to the bottom)
Advanced Kling camera and mask controls (moved to the bottom)

Minor changes -

Added model to the config,
and optional prompt to images, all providers (except stability) accept prompt as part of image-to-video.
audio input and video input is not supported for video generation.
All generating functions also output -> result<string, video-error> . This passes the error much better, than storing it internally and using a uuid to pass values.

Avatar

record avatar {
id: string,
name: string,
preview: option,
}

This matches with Kling's Lip-sync, maybe they supported avatars in the past, but now kling can do lip-sync on any input video. (polling returns a failed(face-detection) error if no face.)

text: string,
voice-id: option,
background: option
) -> string;

voice-id is match for how Kling supports audio, in speak function it is a choice, either [voice-id, text, speed] or [input audio file], no background audio for either.

Effects

extend-video - Supported by both veo and kling

both style-guide and background removal are for image to image (supported by runway and stability)

Suggested Replacements -

Separate extend-video into a new function and
Runway supports video upscaling

Others

Kling supports "video-effects" it takes one or two images and enum and outputs a video, e.g. two images of people and a "hug" effect to create a video of them hugging
Kling supports Multi-Image to video (upto 4 and a prompt) - This is different from image-to-video, both by endpoint, and what it does, This uses the 4 images to make composite, and uses that starting frame. e.g., an image of a boy, a pegasus, and a castle with the prompt "a boy riding a pegasus in front of a castle"
Kling supports a advanced camera configs (which cannot be neatly fit into provider options) and also supports mask to decide which parts to not animate (dynamic and static)

Template

I did not understand this at all, I could not find any API references, Am I meant to pre-create template, with already existing prompt/image so it can be used as a test ?

I am fairly confident on my proposed changes, as I am familiar with the api now I have implemented text-to-video and image-to-video.

Official documentation.

Kling
Veo
Runway
Stability

Nanashi-lab · 2025-06-27T11:54:25Z

This is my proposed wit, this mirrors the feature available with providers while remaining consistent with the original wit. This doesn't include kling advanced camera and mask options.

package golem:video-generation

interface types {
  variant video-error {
    invalid-input(string),
    unsupported-feature(string),
    quota-exceeded,
    generation-failed(string),
    cancelled,
    internal-error(string),
  }

  variant media-input {
    text(string),
    image(reference),
  }

// Added prompt
  record reference {
    data: input-image,
    prompt: option<string>,
    role: option<image-role>,
  }

// Changed to first and last
  enum image-role {
    first,
    last,
  }

  record input-image {
   data: media-data,
  }
  record base-video {
    data: media-data,
  }

  record narration {
    data: media-data,
  }

  variant media-data {
    url(string),
    bytes(list<u8>),
  }

  record generation-config {
    negative-prompt: option<string>,
    seed: option<u64>,
    scheduler: option<string>,
    guidance-scale: option<f32>,
    aspect-ratio: option<aspect-ratio>,
    duration-seconds: option<f32>,
    resolution: option<resolution>,
    enable-audio: option<bool>,
    enhance-prompt: option<bool>,
    provider-options: list<kv>,
    ///Added model and lastframe (Kling Only)
    model: option<string>,
    lastframe: option<input-image: media-data>,
  }

  enum aspect-ratio {
    square,
    portrait,
    landscape,
    cinema,
  }

  enum resolution {
    sd,
    hd,
    fhd,
    uhd,
  }

  record kv {
    key: string,
    value: string,
  }

    record video {
    uri: option<string>,
    base64-bytes: option<list<u8>>,
    mime-type: string,
    width: option<u32>,
    height: option<u32>,
    fps: option<f32>,
    duration-seconds: option<f32>,
  }

  variant job-status {
    pending,
    running,
    succeeded,
    failed(string),
  }

  record video-result {
    status: job-status,
    videos: option<list<video>>,
    metadata: option<list<kv>>,
  }
}

interface video-generation {
  use types.{media-input, generation-config, video-result, video-error};
  
  // changed output from string to result<string, video-error>
  // easier to pass input-invalid, generation error
  // for all generate func
  generate: func(input: media-input, config: generation-config) -> result<string, video-error>;
  poll: func(job-id: string) -> result<video-result, video-error>;
  cancel: func(job-id: string) -> result<string, video-error>;
}

interface lip-sync {
  use types.{video-error, media-data};

// Define the two possible audio source, using voice-id or input audio
  variant audio-source {
    from-text(text: string, voice-id: option<string>, speed: u32),
    from-audio(narration-audio: media-data),
  }

  generate: func(
    input: (base-video: media-data),
    audio: audio-source,
  ) -> result<string, video-error>;

  record voice-info {
    voice-id: string,
    name: string,
    language: string,
    gender: option<string>,
    preview-url: option<string>,
  }

  list-voices: func(language: option<string>) -> result<list<voice-info>, video-error>;
}

interface advanced {
    use types.{video-error, kv};

    // Supported in Kling and veo
    extend-video: func(
        input: base-video,
        prompt: option<string>,
        duration: option<f32>,
    ) -> result<string, video-error>;

    // Supported in runway
    upscale-video: func(
        input: base-video,
    ) -> result<string, video-error>;

    // Supported in kling only
    video-effects: func(
        input: input-image,
        second-image: option<input-image>,
        effect: string,
    ) -> result<string, video-error>;
    
    // Multi image generation, kling Only
    multi-image-generation: func(
        input: input-image,
        other-images: list<input-image>, //Upto max 3 more
        config: generation-config,
    ) -> result<string, video-error>;
}

// I have left this as is, I would like a clarification for this
// I also dont get why no introspection
interface templates {
  use types.{video-error, kv};
  generate-from-template: func(
    template-id: string,
    variables: list<kv>
  ) -> string;
}

world video-generation {
  import types;
  import video-generation;
  import lip-sync;
  import advanced;
  import templates;

  export api: video-generation;
  export lip-sync;
  export template-videos: templates;
  export video-effects: effects;
}

jdegoes · 2025-06-27T16:50:11Z

@Nanashi-lab

I did not spend much time on this WIT so I am glad you took a closer look.

I like your proposed revisions and would suggest a few more:

Delete more strings, e.g. voice-info.gender, video-effect(effect: string). You can use enum or variant to encode the information much more precisely, in a way that is not "stringly-typed".
Instead of having job-id (a pattern I used earlier), use a resource for the job so the user doesn't have to pass stringly-typed information
Delete templates, it seems useless to me, same thing can be done in user-land

Nanashi-lab added 3 commits June 27, 2025 15:14

Add video-generation basic setup and test component

0ac7848

cargo make fix with latest rust toolchain

c10330e

Fix integration test, and llm test component

a3d7913

Nanashi-lab marked this pull request as draft June 27, 2025 11:25

Nanashi-lab added 23 commits June 30, 2025 08:41

Add proposed wit

68a4c40

wit changes

a76d8a2

image roles

24f2d41

cargo make fix and Fix test to new wit

08bbbd9

Add wit for lip-sync

692d3b6

Improve backup wit

d59a950

Refactor wit into seperate interface

8c7e49c

Add full final wit

982ad48

ollama integration test fix

a20de59

Add advanced test component & support advanced Kling features

5f072af

multi-image kling

d318e5e

lip-sync

fcc9014

video effects

e8011bc

runway upscale

e1a111c

runway t2v

0de3058

stability image resize

f94dd3d

Add basic advanced test template

f4a144e

extend video improvement

97a799f

kling extend video

259bc82

mime type addition

21665c4

stability image client

d76d0a8

add support for video input

3cf245c

veo video extend support

c374950

Nanashi-lab added 16 commits July 6, 2025 11:36

stability t2v

ba92a3d

runway cleanup

d01b421

fixed tests

30c828b

fixed tests

d4cac69

voice-id and language is required

4e36fdb

minor stuff

8606048

CI and test fix

0ef09ad

Ollama integration test fix

b6d98cc

more fixes in makefile

d68c948

build-test-component clean up

53b9d0b

Add test components

2e596c7

Add optional for provider options

1d37da4

test component wip

ca2904e

test durability helper

038a86f

video-advanced test update and minor wit change

f280f14

improve tests and clean up

f25fa91

algora-pbc bot added the 🙋 Bounty claim label Jul 10, 2025

algora-pbc bot mentioned this pull request Jul 10, 2025

Implement Durable Video Generation for Multiple Providers (golem:video-generation) #44

Open

Nanashi-lab added 2 commits July 10, 2025 13:03

stability clean

83117d4

cargo make fix

f4e248e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Durable video generation (golem:video-generation) #51

Durable video generation (golem:video-generation) #51

Nanashi-lab commented Jun 27, 2025 •

edited

Loading

Uh oh!

Nanashi-lab commented Jun 27, 2025 •

edited

Loading

Uh oh!

Nanashi-lab commented Jun 27, 2025 •

edited

Loading

Uh oh!

jdegoes commented Jun 27, 2025

Uh oh!

Uh oh!

Durable video generation (golem:video-generation) #51

Are you sure you want to change the base?

Durable video generation (golem:video-generation) #51

Conversation

Nanashi-lab commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Nanashi-lab commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current State of PR (Completed parts)

Wit Changes

Uh oh!

Nanashi-lab commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jdegoes commented Jun 27, 2025

Uh oh!

Uh oh!

Nanashi-lab commented Jun 27, 2025 •

edited

Loading

Nanashi-lab commented Jun 27, 2025 •

edited

Loading

Nanashi-lab commented Jun 27, 2025 •

edited

Loading