Skip to content

Implement Durable Video Generation for Multiple Providers (golem:video-generation) #44

Open
@jdegoes

Description

@jdegoes

Attached to this issue is a WIT file defining a unified interface for GenAI video generation under the package golem:video-generation. This interface abstracts over real-world provider APIs and is designed to support both current and emerging capabilities—while staying lean, realistic, and portable.

The interface provides a consistent async-first API for multimodal video generation tasks. It supports text-to-video, image-conditioned video generation, and base video continuation workflows, as well as advanced features like prompt enhancement, style transfer, and character consistency (e.g., Kling-style multi-image reference conditioning).

The goal of this ticket is to implement the WIT interface across the following providers:

  • Stable Diffusion (via APIs like Stability AI or Replicate-backed pipelines)
  • Runway (Gen-3 Turbo APIs)
  • Google Veo (via Veo’s async generation and polling model)
  • Kling (Kuaishou’s video generation API with advanced consistency controls)

Each provider implementation must be written in Rust, compiled to WASM Components (WASI 0.2 only), and integrate with the Golem execution environment, providing durability as implemented in golem-llm.


Deliverables

For each provider, submit the following:

  • A WASM Component named as follows:
    • video-stable-diffusion.wasm
    • video-runway.wasm
    • video-veo.wasm
    • video-kling.wasm
  • Implement the full WIT interface, including:
    • generate, poll, cancel
    • Support for input variants (text, image, video, audio)
    • Respect generation-config, including optional fields
    • Return video-result with consistent metadata population
  • Include a full test suite using cargo test (see component examples in Golem repo)
  • Implement custom durability via the Golem host durability API.
  • Configure API credentials using environment variables (until wasi-runtime-config is fully supported by Golem)

Implementation Notes

  • Use the cargo component toolchain.
  • You may emulate features that are missing in a provider (e.g., treat prompt enhancement as a no-op if not supported).
  • If a provider cannot support a field, return a runtime error using unsupported-feature(...).

Deviation Policy

If you find that a deviation from the WIT spec is necessary or more ergonomic for a specific provider, you may propose changes. However, deviations must be:

  • Fully justified
  • Reviewed and approved by a core contributor

This API forms the foundation of portable GenAI video agents within the Golem Cloud ecosystem. Your work here will enable agent developers to create high-quality, cross-platform video workflows using a consistent, powerful abstraction.

package golem:video-generation

/// Core types shared across video generation
interface types {
  /// Errors that may occur during video generation
  variant video-error {
    invalid-input(string),
    unsupported-feature(string),
    quota-exceeded,
    generation-failed(string),
    cancelled,
    internal-error(string),
  }

  /// Input modalities supported
  variant media-input {
    text(string),
    image(reference-image),
    video(base-video),
    audio(narration),
  }

  record reference-image {
    data: media-data,
    role: image-role,
  }

  enum image-role {
    general,
    style,
    character,
    composition,
  }

  record base-video {
    data: media-data,
  }

  record narration {
    data: media-data,
  }

  variant media-data {
    url(string),
    bytes(list<u8>),
  }

  /// Generation configuration
  record generation-config {
    negative-prompt: option<string>,
    seed: option<u64>,
    scheduler: option<string>,
    guidance-scale: option<f32>,
    aspect-ratio: option<aspect-ratio>,
    duration-seconds: option<f32>,
    resolution: option<resolution>,
    enable-audio: option<bool>,
    enhance-prompt: option<bool>,
    character-consistency: option<character-consistency>,
    style-consistency: option<style-consistency>,
    provider-options: list<kv>,
  }

  enum aspect-ratio {
    square,
    portrait,
    landscape,
    cinema,
  }

  enum resolution {
    sd,
    hd,
    fhd,
    uhd,
  }

  record character-consistency {
    reference-images: list<media-data>,
    strength: option<f32>,
  }

  record style-consistency {
    reference-images: list<media-data>,
    strength: option<f32>,
  }

  record kv {
    key: string,
    value: string,
  }

  /// Generated video with metadata
  record video {
    uri: option<string>,
    base64-bytes: option<list<u8>>,
    mime-type: string,
    width: option<u32>,
    height: option<u32>,
    fps: option<f32>,
    duration-seconds: option<f32>,
  }

  /// Job status
  variant job-status {
    pending,
    running,
    succeeded,
    failed(string),
  }

  /// Generation result
  record video-result {
    status: job-status,
    videos: option<list<video>>,
    metadata: option<list<kv>>,
  }
}

/// Core unified interface for sync and async providers
interface video-generation {
  use types.{media-input, generation-config, video-result, video-error};

  /// Submit generation task
  generate: func(input: media-input, config: generation-config) -> string;

  /// Poll status and get result if ready
  poll: func(job-id: string) -> result<video-result, video-error>;

  /// Cancel a job if it's running
  cancel: func(job-id: string) -> result<(), video-error>;
}

/// Optional avatar interface
interface avatars {
  use types::{video-error, media-data};

  record avatar {
    id: string,
    name: string,
    preview: option<media-data>,
  }

  /// Generate talking avatar video
  speak: func(
    avatar-id: string,
    text: string,
    voice-id: option<string>,
    background: option<media-data>
  ) -> string;

  list-avatars: func() -> result<list<avatar>, video-error>;

  record voice-info {
    voice-id: string,
    name: string,
    language: string,
    gender: option<string>,
    preview-url: option<string>,
  }

  list-voices: func(language: option<string>) -> result<list<voice-info>, video-error>;
}

/// Optional template interface (no introspection)
interface templates {
  use types::{video-error, kv};

  generate-template: func(
    template-id: string,
    variables: list<kv>
  ) -> string;
}

/// Optional effects interface (limited to real capabilities)
interface effects {
  use types::{video-error, media-data, kv};

  enum effect-type {
    style-transfer,
    extend-video,
    background-replace,
  }

  apply-effect: func(
    input-video: media-data,
    effect: effect-type,
    parameters: list<kv>
  ) -> string;
}

world video-generation {
  import types;
  import video-generation;
  import avatars;
  import templates;
  import effects;

  export api: video-generation;
  export avatar-videos: avatars;
  export template-system: templates;
  export video-effects: effects;
}

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions