Apache Beam Plugin for Flyte — Enable Unified Batch Data Processing via Portable Runners #6540
vakpande
started this conversation in
RFC Incubator
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Summary:
This RFC proposes the development of a Flyte plugin for Apache Beam, enabling users to define data processing pipelines using the Beam SDK and orchestrate them seamlessly as Flyte tasks and workflows.
The goal is to provide users the flexibility of Beam’s portable runner abstraction (supporting Spark, Flink, Dataflow, etc.) within Flyte’s scalable, event-driven orchestration platform.
Motivation:
Flyte offers robust orchestration for machine learning and data workflows, while Apache Beam provides a unified programming model for batch and stream data processing.
By integrating Beam into Flyte, we can:
Proposed Design:
A Flyte user should be able to:
BeamTask
BeamConfig
beam_task_executor.py
flytekit-beam
Execution Model
Plugin Structure
We’ll follow Flyte’s plugin conventions. Here’s how the structure will look:
Supported Runners (Initial):
initial support and testing will focus on:
Additional runners (e.g., Flink, Dataflow, Samza) can be added through configuration without changing the plugin core logic.
Packaging
This plugin will be delivered as a separate pip-installable package:
Testing Plan
Future Extensions
Beta Was this translation helpful? Give feedback.
All reactions