Welcome to the Fal Private Serverless model hosting demos! This collection of examples is designed to guide you through the process of deploying various types of machine learning models on the fal platform. Each demo showcases different features and best practices for creating robust, scalable, and user-friendly model APIs. We will explore image generation, 3D object creation, text-to-speech synthesis, text-to-video generation, and music generation.
Before diving into specific demos, let's touch upon some core concepts you'll see repeatedly:
fal.App
: This is the fundamental class you'll use to define your app. It encapsulates your model, its dependencies, and the inference logic. You configure aspects likename
,machine_type
,requirements
,keep_alive
,min_concurrency
, andmax_concurrency
directly on this class.- Input and Output Models: Using
pydantic.BaseModel
, you define structured inputs and outputs for your API. TheField
object allows you to provide rich metadata like titles, descriptions, examples, and validation rules (e.g.,ge
,le
for numerical ranges,Literal
for specific choices). This metadata is automatically used to generate an interactive playground and OpenAPI specification for your model. setup()
Method: This special method within yourfal.App
class is called once when a new worker instance starts. It's the ideal place to load your models, download any necessary assets, and perform one-time initializations. This ensures that these heavy operations don't slow down individual inference requests.@fal.endpoint()
Decorator: This decorator is used to expose methods of yourfal.App
class as HTTP endpoints. You can define multiple endpoints within a single app, each potentially serving a different variant or functionality of your model.- Requirements: You specify your Python package dependencies in a
requirements
list within thefal.App
class. It's crucial to pin package versions (e.g.,torch==2.6.0
) and even commit hashes for dependencies installed from Git repositories to ensure consistent and reproducible environments. - Machine Types: fal offers various machine types (e.g.,
GPU-H100
,M
for CPU-medium). You select the appropriatemachine_type
in yourfal.App
configuration based on your model's computational needs. - Billing: For many models, especially generative ones, you might want to implement custom billing logic. This can often be done by setting the
x-fal-billable-units
header in thefastapi.Response
object. The actual cost per unit is configured in your Fal dashboard. - Error Handling: Fal provides
FieldException
for input validation errors that are nicely displayed in the UI, and you can use FastAPI'sHTTPException
for other types of errors. fal.toolkit
: This toolkit provides helpful utilities for common tasks, such as image handling (Image
,ImageSizeInput
), file operations (File
), and more.- Secrets Management: For sensitive information like API keys, fal allows you to set secrets (e.g., via
fal secrets set MY_SECRET_KEY <value>
) which can then be accessed as environment variables within your app. - Custom Docker Images: For complex dependencies that go beyond Python packages (e.g., system libraries via
apt-get
), fal supports deploying applications using custom Docker images.
Now, let's explore each demo in detail.
The fal_demos/image/sana.py
demo showcases how to host a text-to-image generation model, specifically using the Sana and SanaSprint pipelines. It demonstrates robust input/output definitions, model loading, safety checking, and custom billing.
Overview: This application provides two endpoints for generating images from text prompts: a standard quality endpoint and a "sprint" endpoint optimized for faster generation with fewer inference steps.
Key fal Concepts Demonstrated:
The definition of input data structures is a crucial first step. The BaseInput
class, found at fal_demos/image/sana.py#L13
, serves as a common foundation for image generation parameters. Each field, like prompt
(fal_demos/image/sana.py#L15
) or num_inference_steps
(fal_demos/image/sana.py#L31
), is defined using pydantic.Field
, complete with titles, descriptions, and illustrative examples. This significantly enhances the usability of the auto-generated API playground. For instance, the image_size
field (fal_demos/image/sana.py#L26
) utilizes ImageSizeInput
from fal.toolkit.image
, which is then converted to an ImageSize
object internally.
The application defines specific input models for its two endpoints: TextToImageInput
(fal_demos/image/sana.py#L65
) for the standard generation and SprintInput
(fal_demos/image/sana.py#L69
) for the faster "sprint" version. The SprintInput
class inherits from BaseInput
but overrides num_inference_steps
with a different default and validation range suitable for quicker inference.
Similarly, output structures are defined with SanaOutput
(fal_demos/image/sana.py#L78
) and SanaSprintOutput
(fal_demos/image/sana.py#L95
). These classes specify the format of the response, including a list of generated Image
objects, also from fal.toolkit.image
, and other metadata like the seed
used.
The core of the application is the Sana
class (fal_demos/image/sana.py#L111
), which inherits from fal.App
. Here, serverless execution parameters are configured: keep_alive=600
ensures a worker stays warm for 10 minutes after the last request, min_concurrency=0
allows the app to scale down to zero workers, max_concurrency=10
limits the number of concurrent workers, and name="sana"
defines the application's name, influencing its endpoint URL.
Dependencies are explicitly listed in the requirements
attribute (fal_demos/image/sana.py#L119
), including torch
, diffusers
(pinned to a specific commit hash for stability), and other necessary packages. The local_python_modules
attribute (fal_demos/image/sana.py#L126
) ensures that local modules like fal_demos
are correctly packaged. The application specifies machine_type = "GPU-H100"
(fal_demos/image/sana.py#L128
) to leverage powerful GPU acceleration.
The model loading logic resides in the setup()
method (fal_demos/image/sana.py#L129
). Inside this method, SanaPipeline
and SanaSprintPipeline
are imported and initialized. Notably, the text encoder from the base SanaPipeline
is reused for the SanaSprintPipeline
(fal_demos/image/sana.py#L142
) to optimize memory usage and loading time. These initialized pipelines are stored in self.pipes
for access during inference.
A private helper method, _generate()
(fal_demos/image/sana.py#L148
), encapsulates the common image generation logic for both endpoints. This method handles input preprocessing, seed generation, model invocation, and postprocessing. An important aspect demonstrated here is the safety checker integration using postprocess_images
from fal.toolkit.image.safety_checker
(fal_demos/image/sana.py#L176
).
Custom billing is implemented by calculating resolution_factor
based on the image dimensions and then setting the x-fal-billable-units
header on the fastapi.Response
object (fal_demos/image/sana.py#L181-L186
).
Finally, two public endpoints are defined: generate
at the root path (fal_demos/image/sana.py#L198
) and generate_sprint
at the /sprint
path (fal_demos/image/sana.py#L207
). Both endpoints utilize the _generate
method, passing the appropriate input model and model identifier.
The comments at the end of the file (fal_demos/image/sana.py#L214-L219
) provide instructions on how to run the app using fal run
and example URLs for testing the deployed endpoints.
The fal_demos/image/hunyuan3d.py
demo illustrates a different pattern: hosting a "proxy" application that, in turn, calls another pre-existing Fal function (in this case, fal-ai/hunyuan3d
) to perform the actual 3D model generation. This is useful for adding custom preprocessing, postprocessing, or simplified interfaces around existing models.
Overview: This application takes an input image URL and parameters, submits a job to an internal Hunyuan3D model on Fal, and then streams back the results, including the generated 3D object file.
Key fal Concepts Demonstrated:
The input parameters for the 3D generation are defined in the Hunyuan3DInput
model (fal_demos/image/hunyuan3d.py#L9
), including the input_image_url
, whether to generate a textured_mesh
, and other model-specific settings. The ObjectOutput
model (fal_demos/image/hunyuan3d.py#L49
) describes the expected output, primarily a File
object representing the 3D mesh.
The Hunyuan3D
class (fal_demos/image/hunyuan3d.py#L65
) is configured with max_multiplexing=10
. This setting allows a single worker instance to handle up to 10 concurrent requests by interleaving their processing, which is particularly efficient for I/O-bound tasks or when waiting for other services.
A key feature demonstrated is the use of Fal secrets. The setup()
method (fal_demos/image/hunyuan3d.py#L72
) retrieves a secret key named MY_SECRET_KEY
from environment variables and sets it as FAL_KEY
. This is a common pattern for authenticating with other services or Fal functions. The comment indicates how to set this secret using the fal secrets set
command.
The main logic is within the generate_image
endpoint (fal_demos/image/hunyuan3d.py#L77
). This function uses fal_client.submit_async
(fal_demos/image/hunyuan3d.py#L80
) to make an asynchronous call to the fal-ai/hunyuan3d/v2/mini/turbo
function, passing along the user-provided arguments.
The application then iterates over events from the submitted job using handle.iter_events
(fal_demos/image/hunyuan3d.py#L90
). This allows for real-time feedback, printing status updates like "Queued" or "InProgress" along with any logs.
Comprehensive error handling for the Fal client interaction is shown in the try-except
block (fal_demos/image/hunyuan3d.py#L102-L120
). It catches FieldException
(for user input errors), HTTPStatusError
(for network issues during the call), and fal_client.client.FalClientError
(for other errors from the Fal platform). For user errors, it re-raises them as an HTTPException
with a 400 status code, and for server errors, a 500 status code, providing meaningful error messages to the client.
The fal_demos/tts/kokoro.py
demo focuses on hosting a text-to-speech (TTS) model, Kokoro, which supports multiple languages and voices. It demonstrates how to manage different input schemas for various language options and how to load multiple model variants.
Overview: This application provides several TTS endpoints, each tailored for a specific language or accent (American English, British English, Japanese). Users provide text and select a voice, and the app generates an audio file.
Key fal Concepts Demonstrated:
To cater to different languages and their specific voices, multiple Pydantic input models are defined: AmEnglishRequest
(fal_demos/tts/kokoro.py#L9
), BrEnglishRequest
(fal_demos/tts/kokoro.py#L48
), and JapaneseRequest
(fal_demos/tts/kokoro.py#L71
). A notable feature here is the use of typing.Literal
for the voice
field in each request model (e.g., fal_demos/tts/kokoro.py#L21
). This restricts the input to a predefined set of valid voice IDs, providing clear options to the user and simplifying validation. Corresponding output models (AmEngOutput
(fal_demos/tts/kokoro.py#L92
), BrEngOutput
, JapaneseOutput
) define the structure of the response, which includes an audio
field of type fal.toolkit.File
.
The Kokoro
application class is defined at fal_demos/tts/kokoro.py#L125
. The requirements
list (fal_demos/tts/kokoro.py#L132
) includes the kokoro
library and its language-specific components like misaki[en]
and misaki[ja]
, as well as a Spacy model downloaded from a URL. The machine_type
is set to "L"
(CPU-Large) (fal_demos/tts/kokoro.py#L141
), as Kokoro is relatively lightweight and runs efficiently on CPU.
In the setup()
method (fal_demos/tts/kokoro.py#L143
), instances of kokoro.KPipeline
are created for each supported language ("American English", "British English", "Japanese") and stored in a dictionary self.pipelines
. This pre-loads the necessary models for each language variant.
The shared TTS generation logic is in the _generate()
method (fal_demos/tts/kokoro.py#L149
). This method first checks if the input prompt length exceeds a certain limit (20,000 characters) and raises a FieldException
if it does (fal_demos/tts/kokoro.py#L154-L158
), providing user-friendly feedback. It then uses the appropriate pipeline from self.pipelines
to generate the audio. The generated audio, a NumPy array, is concatenated if produced in chunks.
Custom billing is implemented by setting the x-fal-billable-units
header based on the length of the input prompt, with a minimum of 1 unit and scaling every 1000 characters (fal_demos/tts/kokoro.py#L173
).
The generated audio is saved to a temporary WAV file using soundfile.sf.write
. This temporary file is then returned as a fal.toolkit.File
object, specifying content_type="audio/wav"
and repository="cdn"
to serve it efficiently via Fal's CDN (fal_demos/tts/kokoro.py#L176-L180
).
Multiple endpoints are defined to serve the different language versions: /
and /american-english
(fal_demos/tts/kokoro.py#L182
, fal_demos/tts/kokoro.py#L188
) for American English, /british-english
(fal_demos/tts/kokoro.py#L194
) for British English, and /japanese
(fal_demos/tts/kokoro.py#L200
) for Japanese. Each calls the _generate
method with the appropriate language identifier.
The fal_demos/video/wan.py
demo tackles a more complex scenario: hosting the Wan text-to-video model. This involves cloning a Git repository, managing model weights, and integrating custom safety and prompt enhancement features by calling other Fal functions.
Overview: This application allows users to generate short video clips from text prompts using the Wan 2.1 model (1.3B version). It includes optional safety checks and prompt expansion features.
Key fal Concepts Demonstrated:
The WanT2VRequest
input model (fal_demos/video/wan.py#L10
) defines parameters such as prompt
, negative_prompt
, aspect_ratio
, and controls for enabling enable_safety_checker
and enable_prompt_expansion
. The WanT2VResponse
model (fal_demos/video/wan.py#L67
) specifies that the output will be a video
(as a File
object) and the seed
used.
The Wan
application class is defined at fal_demos/video/wan.py#L79
. Its requirements
list (fal_demos/video/wan.py#L86
) is extensive, including specific versions of torch
, diffusers
, and other libraries, as well as a FlashAttention wheel directly from a GitHub releases URL. This highlights Fal's flexibility in handling diverse dependency sources.
The setup()
method (fal_demos/video/wan.py#L100
) is particularly involved. It begins by cloning the Wan 2.1 Git repository using fal.toolkit.clone_repository
(fal_demos/video/wan.py#L107
). This function clones the repo to a temporary directory (specified as /tmp/wan-t2v
), pins it to a specific commit_hash
for reproducibility, and adds the repository's path to sys.path
(include_to_path=True
). The current working directory is then changed to this cloned repository's path (fal_demos/video/wan.py#L114
because the Wan library scripts expect to be run from their root directory.
Next, model weights are downloaded from Hugging Face Hub using huggingface_hub.snapshot_download
(fal_demos/video/wan.py#L118
). The weights are downloaded to a directory managed by Fal (FAL_MODEL_WEIGHTS_DIR
) to leverage caching across worker restarts if local_dir_use_symlinks=True
is effective and the underlying storage persists. Finally, the WanT2V
model object is initialized using configurations and checkpoints from the cloned repository (fal_demos/video/wan.py#L129
).
This demo showcases advanced features by calling other Fal functions for auxiliary tasks. The _is_nsfw_prompt
method (fal_demos/video/wan.py#L140
) uses fal_client.subscribe
to call fal-ai/any-llm
with a system prompt designed to classify the user's text prompt as NSFW or not. Similarly, _is_nsfw_request
(fal_demos/video/wan.py#L158
) first uploads an image (if this were an image-to-video model, or a frame from a generated video for post-check) using fal_client.upload_image
and then calls two different Fal functions (fal-ai/imageutils/nsfw
and fal-ai/any-llm/vision
) for a multi-layered NSFW image check. The _expand_prompt
method (fal_demos/video/wan.py#L188
) calls fal-ai/any-llm
with a specific system prompt to enhance and elaborate on the user's original prompt.
The main inference endpoint is generate_image_to_video
(fal_demos/video/wan.py#L210
, specifically path-versioned as /v2.1/1.3b/text-to-video
. Inside this function, if enable_safety_checker
is true, the _is_nsfw_prompt
method is called. If enable_prompt_expansion
is true, _expand_prompt
is invoked (fal_demos/video/wan.py#L213-L217
). The Wan T2V model's generate
method is then called to create the video tensor (fal_demos/video/wan.py#L221
). The resulting video tensor is saved to a temporary MP4 file using wan.utils.utils.cache_video
, and this file is returned as a fal.toolkit.File
(fal_demos/video/wan.py#L230-L239
).
The fal_demos/audio/diffrhythm.py
demo illustrates how to deploy a model that requires system-level dependencies beyond standard Python packages. It achieves this by using a custom Docker image. DiffRhythm generates music based on lyrics and an optional reference audio or style prompt.
Overview: This application takes lyrics (with timestamps), an optional reference audio URL or style prompt, and other parameters to generate a piece of music using the DiffRhythm model.
Key fal Concepts Demonstrated:
The TextToMusicInput
model (fal_demos/audio/diffrhythm.py#L11
) defines the inputs. The lyrics
field is notable for its example, which shows a structured format with timestamps, and for its ui
hint (fal_demos/audio/diffrhythm.py#L34
) suggesting a textarea
for better user experience in the playground. The reference_audio_url
also has a ui={"important": True}
hint (fal_demos/audio/diffrhythm.py#L44
) to prioritize it in the UI. The Output
model is straightforward, expecting an audio
File
(fal_demos/audio/diffrhythm.py#L72
).
A utility function extract_segments
is defined (fal_demos/audio/diffrhythm.py#L88
) for parsing text with bracketed keys, though the main inference logic uses get_lrc_token
from the DiffRhythm library for lyrics processing.
The most significant feature of this demo is the use of a custom Docker image. The DOCKER_STRING
variable (fal_demos/audio/diffrhythm.py#L112
) contains the Dockerfile instructions. It starts FROM pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel
, then uses apt-get update && apt-get install -y
to install system packages like git
, espeak-ng
, and ffmpeg
. After that, it installs a long list of Python dependencies using pip install
.
The DiffRhythm
application class (fal_demos/audio/diffrhythm.py#L162
) specifies kind="container"
and uses image=fal.ContainerImage.from_dockerfile_str(DOCKER_STRING)
to tell Fal Serverless to build and use this custom environment.
Inside the setup()
method (fal_demos/audio/diffrhythm.py#L169
), the DiffRhythm Hugging Face Space repository is cloned using clone_repository
(fal_demos/audio/diffrhythm.py#L172
). Specific files, like negative_prompt.npy
and an ONNX model, are downloaded using download_file
(fal_demos/audio/diffrhythm.py#L179
, fal_demos/audio/diffrhythm.py#L186
into the cloned repository structure. Finally, the prepare_model
utility from DiffRhythm is called to load and initialize the various model components (cfm
, cfm_full
, tokenizer
, muq
, vae
).
A warmup()
method (fal_demos/audio/diffrhythm.py#L195
) is included, which calls the internal _generate
method with a sample input. This is a good practice to ensure the model is fully loaded before handling actual user requests, reducing cold start latency for the first request after a worker starts.
The _generate()
method (fal_demos/audio/diffrhythm.py#L206
) contains the main inference logic. It first validates that either a style_prompt
or reference_audio_url
is provided (fal_demos/audio/diffrhythm.py#L209
). It conditionally loads either the cfm
or cfm_full
model based on the requested music_duration
(fal_demos/audio/diffrhythm.py#L221-L226
). Lyrics are processed using get_lrc_token
. If a reference_audio_url
is given, it's downloaded, and get_audio_style_prompt
is used; otherwise, get_text_style_prompt
processes the style_prompt
text (fal_demos/audio/diffrhythm.py#L230-L250
). Error handling within this section uses FieldException
to report issues related to lyrics, reference audio, or style prompt processing.
The core inference
function from the DiffRhythm library is then called (fal_demos/audio/diffrhythm.py#L253
). The generated audio is saved to a temporary WAV file within a temporary directory, and custom billing units are calculated based on the audio duration (fal_demos/audio/diffrhythm.py#L270
). The output is returned as a File
object.
The primary endpoint /
is defined at fal_demos/audio/diffrhythm.py#L274
}, which simply calls the _generate
method.
These demos provide a solid foundation for understanding how to deploy a wide variety of models on fal. By going through them, you can learn how to structure your applications, manage dependencies, handle inputs and outputs effectively, and leverage the platform's features to build powerful and scalable AI services. Remember to consult the official Fal documentation for more in-depth information on any of the concepts discussed.