Implement Durable Video Generation for Multiple Providers (golem:video-generation)

Attached to this issue is a WIT file defining a unified interface for GenAI video generation under the package `golem:video-generation`. This interface abstracts over real-world provider APIs and is designed to support both current and emerging capabilities—while staying lean, realistic, and portable.

The interface provides a consistent async-first API for multimodal video generation tasks. It supports text-to-video, image-conditioned video generation, and base video continuation workflows, as well as advanced features like prompt enhancement, style transfer, and character consistency (e.g., Kling-style multi-image reference conditioning).

The goal of this ticket is to implement the WIT interface across the following providers:

- **Stable Diffusion** (via APIs like Stability AI or Replicate-backed pipelines)
- **Runway** (Gen-3 Turbo APIs)
- **Google Veo** (via Veo’s async generation and polling model)
- **Kling** (Kuaishou’s video generation API with advanced consistency controls)

Each provider implementation must be written in **Rust**, compiled to **WASM Components** (WASI 0.2 only), and integrate with the **Golem execution environment**, providing durability as implemented in `golem-llm`.

---

### Deliverables

For each provider, submit the following:

- A WASM Component named as follows:
  - `video-stable-diffusion.wasm`
  - `video-runway.wasm`
  - `video-veo.wasm`
  - `video-kling.wasm`
- Implement the full WIT interface, including:
  - `generate`, `poll`, `cancel`
  - Support for input variants (`text`, `image`, `video`, `audio`)
  - Respect `generation-config`, including optional fields
  - Return `video-result` with consistent metadata population
- Include a full test suite using `cargo test` (see component examples in Golem repo)
- Implement **custom durability** via the Golem host durability API. 
- Configure API credentials using **environment variables** (until `wasi-runtime-config` is fully supported by Golem)

---

### Implementation Notes

- Use the [`cargo component`](https://github.com/bytecodealliance/cargo-component) toolchain.
- You may emulate features that are missing in a provider (e.g., treat prompt enhancement as a no-op if not supported).
- If a provider cannot support a field, return a runtime error using `unsupported-feature(...)`.

---

### Deviation Policy

If you find that a deviation from the WIT spec is necessary or more ergonomic for a specific provider, you may propose changes. However, deviations must be:

- Fully justified
- Reviewed and approved by a core contributor

---

This API forms the foundation of portable GenAI video agents within the Golem Cloud ecosystem. Your work here will enable agent developers to create high-quality, cross-platform video workflows using a consistent, powerful abstraction.


```wit
package golem:video-generation

/// Core types shared across video generation
interface types {
  /// Errors that may occur during video generation
  variant video-error {
    invalid-input(string),
    unsupported-feature(string),
    quota-exceeded,
    generation-failed(string),
    cancelled,
    internal-error(string),
  }

  /// Input modalities supported
  variant media-input {
    text(string),
    image(reference-image),
    video(base-video),
    audio(narration),
  }

  record reference-image {
    data: media-data,
    role: image-role,
  }

  enum image-role {
    general,
    style,
    character,
    composition,
  }

  record base-video {
    data: media-data,
  }

  record narration {
    data: media-data,
  }

  variant media-data {
    url(string),
    bytes(list<u8>),
  }

  /// Generation configuration
  record generation-config {
    negative-prompt: option<string>,
    seed: option<u64>,
    scheduler: option<string>,
    guidance-scale: option<f32>,
    aspect-ratio: option<aspect-ratio>,
    duration-seconds: option<f32>,
    resolution: option<resolution>,
    enable-audio: option<bool>,
    enhance-prompt: option<bool>,
    character-consistency: option<character-consistency>,
    style-consistency: option<style-consistency>,
    provider-options: list<kv>,
  }

  enum aspect-ratio {
    square,
    portrait,
    landscape,
    cinema,
  }

  enum resolution {
    sd,
    hd,
    fhd,
    uhd,
  }

  record character-consistency {
    reference-images: list<media-data>,
    strength: option<f32>,
  }

  record style-consistency {
    reference-images: list<media-data>,
    strength: option<f32>,
  }

  record kv {
    key: string,
    value: string,
  }

  /// Generated video with metadata
  record video {
    uri: option<string>,
    base64-bytes: option<list<u8>>,
    mime-type: string,
    width: option<u32>,
    height: option<u32>,
    fps: option<f32>,
    duration-seconds: option<f32>,
  }

  /// Job status
  variant job-status {
    pending,
    running,
    succeeded,
    failed(string),
  }

  /// Generation result
  record video-result {
    status: job-status,
    videos: option<list<video>>,
    metadata: option<list<kv>>,
  }
}

/// Core unified interface for sync and async providers
interface video-generation {
  use types.{media-input, generation-config, video-result, video-error};

  /// Submit generation task
  generate: func(input: media-input, config: generation-config) -> string;

  /// Poll status and get result if ready
  poll: func(job-id: string) -> result<video-result, video-error>;

  /// Cancel a job if it's running
  cancel: func(job-id: string) -> result<(), video-error>;
}

/// Optional avatar interface
interface avatars {
  use types::{video-error, media-data};

  record avatar {
    id: string,
    name: string,
    preview: option<media-data>,
  }

  /// Generate talking avatar video
  speak: func(
    avatar-id: string,
    text: string,
    voice-id: option<string>,
    background: option<media-data>
  ) -> string;

  list-avatars: func() -> result<list<avatar>, video-error>;

  record voice-info {
    voice-id: string,
    name: string,
    language: string,
    gender: option<string>,
    preview-url: option<string>,
  }

  list-voices: func(language: option<string>) -> result<list<voice-info>, video-error>;
}

/// Optional template interface (no introspection)
interface templates {
  use types::{video-error, kv};

  generate-template: func(
    template-id: string,
    variables: list<kv>
  ) -> string;
}

/// Optional effects interface (limited to real capabilities)
interface effects {
  use types::{video-error, media-data, kv};

  enum effect-type {
    style-transfer,
    extend-video,
    background-replace,
  }

  apply-effect: func(
    input-video: media-data,
    effect: effect-type,
    parameters: list<kv>
  ) -> string;
}

world video-generation {
  import types;
  import video-generation;
  import avatars;
  import templates;
  import effects;

  export api: video-generation;
  export avatar-videos: avatars;
  export template-system: templates;
  export video-effects: effects;
}

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement Durable Video Generation for Multiple Providers (golem:video-generation) #44

Deliverables

Implementation Notes

Deviation Policy

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement Durable Video Generation for Multiple Providers (golem:video-generation) #44

Description

Deliverables

Implementation Notes

Deviation Policy

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions