Description
Attached to this issue is a WIT file defining a unified interface for GenAI video generation under the package golem:video-generation
. This interface abstracts over real-world provider APIs and is designed to support both current and emerging capabilities—while staying lean, realistic, and portable.
The interface provides a consistent async-first API for multimodal video generation tasks. It supports text-to-video, image-conditioned video generation, and base video continuation workflows, as well as advanced features like prompt enhancement, style transfer, and character consistency (e.g., Kling-style multi-image reference conditioning).
The goal of this ticket is to implement the WIT interface across the following providers:
- Stable Diffusion (via APIs like Stability AI or Replicate-backed pipelines)
- Runway (Gen-3 Turbo APIs)
- Google Veo (via Veo’s async generation and polling model)
- Kling (Kuaishou’s video generation API with advanced consistency controls)
Each provider implementation must be written in Rust, compiled to WASM Components (WASI 0.2 only), and integrate with the Golem execution environment, providing durability as implemented in golem-llm
.
Deliverables
For each provider, submit the following:
- A WASM Component named as follows:
video-stable-diffusion.wasm
video-runway.wasm
video-veo.wasm
video-kling.wasm
- Implement the full WIT interface, including:
generate
,poll
,cancel
- Support for input variants (
text
,image
,video
,audio
) - Respect
generation-config
, including optional fields - Return
video-result
with consistent metadata population
- Include a full test suite using
cargo test
(see component examples in Golem repo) - Implement custom durability via the Golem host durability API.
- Configure API credentials using environment variables (until
wasi-runtime-config
is fully supported by Golem)
Implementation Notes
- Use the
cargo component
toolchain. - You may emulate features that are missing in a provider (e.g., treat prompt enhancement as a no-op if not supported).
- If a provider cannot support a field, return a runtime error using
unsupported-feature(...)
.
Deviation Policy
If you find that a deviation from the WIT spec is necessary or more ergonomic for a specific provider, you may propose changes. However, deviations must be:
- Fully justified
- Reviewed and approved by a core contributor
This API forms the foundation of portable GenAI video agents within the Golem Cloud ecosystem. Your work here will enable agent developers to create high-quality, cross-platform video workflows using a consistent, powerful abstraction.
package golem:video-generation
/// Core types shared across video generation
interface types {
/// Errors that may occur during video generation
variant video-error {
invalid-input(string),
unsupported-feature(string),
quota-exceeded,
generation-failed(string),
cancelled,
internal-error(string),
}
/// Input modalities supported
variant media-input {
text(string),
image(reference-image),
video(base-video),
audio(narration),
}
record reference-image {
data: media-data,
role: image-role,
}
enum image-role {
general,
style,
character,
composition,
}
record base-video {
data: media-data,
}
record narration {
data: media-data,
}
variant media-data {
url(string),
bytes(list<u8>),
}
/// Generation configuration
record generation-config {
negative-prompt: option<string>,
seed: option<u64>,
scheduler: option<string>,
guidance-scale: option<f32>,
aspect-ratio: option<aspect-ratio>,
duration-seconds: option<f32>,
resolution: option<resolution>,
enable-audio: option<bool>,
enhance-prompt: option<bool>,
character-consistency: option<character-consistency>,
style-consistency: option<style-consistency>,
provider-options: list<kv>,
}
enum aspect-ratio {
square,
portrait,
landscape,
cinema,
}
enum resolution {
sd,
hd,
fhd,
uhd,
}
record character-consistency {
reference-images: list<media-data>,
strength: option<f32>,
}
record style-consistency {
reference-images: list<media-data>,
strength: option<f32>,
}
record kv {
key: string,
value: string,
}
/// Generated video with metadata
record video {
uri: option<string>,
base64-bytes: option<list<u8>>,
mime-type: string,
width: option<u32>,
height: option<u32>,
fps: option<f32>,
duration-seconds: option<f32>,
}
/// Job status
variant job-status {
pending,
running,
succeeded,
failed(string),
}
/// Generation result
record video-result {
status: job-status,
videos: option<list<video>>,
metadata: option<list<kv>>,
}
}
/// Core unified interface for sync and async providers
interface video-generation {
use types.{media-input, generation-config, video-result, video-error};
/// Submit generation task
generate: func(input: media-input, config: generation-config) -> string;
/// Poll status and get result if ready
poll: func(job-id: string) -> result<video-result, video-error>;
/// Cancel a job if it's running
cancel: func(job-id: string) -> result<(), video-error>;
}
/// Optional avatar interface
interface avatars {
use types::{video-error, media-data};
record avatar {
id: string,
name: string,
preview: option<media-data>,
}
/// Generate talking avatar video
speak: func(
avatar-id: string,
text: string,
voice-id: option<string>,
background: option<media-data>
) -> string;
list-avatars: func() -> result<list<avatar>, video-error>;
record voice-info {
voice-id: string,
name: string,
language: string,
gender: option<string>,
preview-url: option<string>,
}
list-voices: func(language: option<string>) -> result<list<voice-info>, video-error>;
}
/// Optional template interface (no introspection)
interface templates {
use types::{video-error, kv};
generate-template: func(
template-id: string,
variables: list<kv>
) -> string;
}
/// Optional effects interface (limited to real capabilities)
interface effects {
use types::{video-error, media-data, kv};
enum effect-type {
style-transfer,
extend-video,
background-replace,
}
apply-effect: func(
input-video: media-data,
effect: effect-type,
parameters: list<kv>
) -> string;
}
world video-generation {
import types;
import video-generation;
import avatars;
import templates;
import effects;
export api: video-generation;
export avatar-videos: avatars;
export template-system: templates;
export video-effects: effects;
}