Apache Beam Plugin for Flyte — Enable Unified Batch Data Processing via Portable Runners #6540

vakpande · 2025-07-24T13:40:11Z

vakpande
Jul 24, 2025

Summary:

This RFC proposes the development of a Flyte plugin for Apache Beam, enabling users to define data processing pipelines using the Beam SDK and orchestrate them seamlessly as Flyte tasks and workflows.

The goal is to provide users the flexibility of Beam’s portable runner abstraction (supporting Spark, Flink, Dataflow, etc.) within Flyte’s scalable, event-driven orchestration platform.

Motivation:

Flyte offers robust orchestration for machine learning and data workflows, while Apache Beam provides a unified programming model for batch and stream data processing.

By integrating Beam into Flyte, we can:

Bridge Beam’s flexible compute model with Flyte’s composable workflows
Enable end-to-end ML pipelines where Beam handles large-scale ETL, and Flyte handles coordination, ML model training, and deployment
Allow users to write once using Beam and execute anywhere using the runner of their choice via Flyte

Proposed Design:

User Flow:

A Flyte user should be able to:

Write an Apache Beam pipeline in Python (using the Beam SDK)
Decorate it with a @task and include it in a Flyte @workflow
Specify the Beam runner (Spark, Flink, DirectRunner, etc.) via task configuration
Launch the Flyte workflow — Flyte will trigger Beam job execution using the desired backend

Plugin Components

Component	Description
`BeamTask`	A new Flyte task type for executing Beam pipelines
`BeamConfig`	Defines runner type, pipeline options, job name, and output specs
`beam_task_executor.py`	The execution logic for submitting Beam jobs
`flytekit-beam`	A standalone plugin package (to keep it modular and installable )

Configuration Example

from flytekit import task, workflow
from flytekitplugins.beam import BeamTask, BeamConfig

@task(task_config=BeamConfig(
    runner="SparkRunner", # Or DirectRunner, FlinkRunner, etc.
    pipeline_options={
        "spark_master": "local[*]",
        "temp_location": "gs://my-bucket/temp",
    }
))
def wordcount_beam_task() -> str:
    import apache_beam as beam
    from apache_beam.options.pipeline_options import PipelineOptions

    def extract_words(line):
        return line.split()

    pipeline_options = PipelineOptions()
    with beam.Pipeline(options=pipeline_options) as p:
        (p | 'Read' >> beam.Create(["Hello world", "Flyte Beam Plugin"])
           | 'Split' >> beam.FlatMap(extract_words)
           | 'PairWithOne' >> beam.Map(lambda w: (w, 1))
           | 'Group' >> beam.CombinePerKey(sum)
           | 'Write' >> beam.io.WriteToText("/tmp/output.txt"))
    return "/tmp/output.txt"

@workflow
def wordcount_flow() -> str:
    return wordcount_beam_task()

Execution Model

Pre-execution: Flyte serializes the task and provides BeamConfig options to the plugin
Execution Phase: The plugin:
- Initializes the Beam pipeline using Beam SDK
- Submits it to the specified runner (e.g., Spark)
- Monitors job status and streams logs if possible
Post-execution:
- Fetches outputs (if materialized to file, table, or DB)
- Returns output URI or path to Flyte

Plugin Structure

We’ll follow Flyte’s plugin conventions. Here’s how the structure will look:

├── __init__.py
├── beam_task.py
├── beam_config.py
├── beam_task_executor.py
├── engines/
│   ├── __init__.py
│   ├── base.py
│   ├── spark.py
│   ├── flink.py
│   ├── direct.py
│   └── dataflow.py
├── interface.py
├── constants.py
├── utils.py
├── exceptions.py
├── README.md
│
├── tests/
    ├── __init__.py
    │
    ├── test_beam_task.py         
    ├── test_beam_config.py       
    ├── test_executor.py          
    ├── test_interface.py        
    ├── test_constants.py        
    ├── test_utils.py             
    ├── test_exceptions.py       
    │
    └── engines/                  
        ├── __init__.py
        ├── test_base.py          
        ├── test_spark.py        
        ├── test_flink.py         
        ├── test_direct.py       
        └── test_dataflow.py      
│
├── scripts/
│   └── install_dev.sh       
│
├── setup.py

Supported Runners (Initial):

initial support and testing will focus on:

DirectRunner – for local development and CI validation
SparkRunner – for scalable distributed execution in Flyte-native environments
Additional runners (e.g., Flink, Dataflow, Samza) can be added through configuration without changing the plugin core logic.

Packaging

This plugin will be delivered as a separate pip-installable package:

pip install flytekitplugins-beam

Testing Plan

Unit tests for BeamTask config and validation
Integration tests using DirectRunner and SparkRunner
End-to-end Flyte workflows with Beam pipelines embedded

Future Extensions

Support streaming pipelines
Integrate with ML-specific Beam transforms
UI integration for showing DAGs and output artifacts
Explore Beam SQL integration for declarative pipelines

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Apache Beam Plugin for Flyte — Enable Unified Batch Data Processing via Portable Runners #6540

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Apache Beam Plugin for Flyte — Enable Unified Batch Data Processing via Portable Runners #6540

Uh oh!

vakpande Jul 24, 2025

Summary:

Motivation:

Proposed Design:

Execution Model

Plugin Structure

Supported Runners (Initial):

Packaging

Testing Plan

Future Extensions

Replies: 0 comments

vakpande
Jul 24, 2025