Skip to content

Reevaluating Pipeline Separation: Addressing Complexity from Separated Lambda defined in Step Functions #4

@tzuwei93

Description

@tzuwei93

Note: This comparative analysis was developed with GitHub Copilot assistance to ensure comprehensive coverage of data modeling patterns.

Problem Statement

Splitting the pipeline into multiple Lambda functions and managing them with Step Functions has added significant complexity to this project.

Many of the functions in our pipeline are highly reusable. Spreading them across different Lambda containers can result in unnecessary duplication and increased maintenance overhead. For instance, functions like extract_s3_path and GooglePlacesClient are used in multiple places. And it is also difficult to trace relationships of table dependencies when there are bugs. Rather than managing this shared logic in a scattered manner, you could package these functions into a custom Python package or centralize all data processing logic within a single Python module. However, even these approaches fall short compared to adopting an open-source tool that is purpose-built for data pipeline orchestration and data transformation management.

Currently, the Lambda stages are tightly coupled: data and configurations are passed directly between them, so modifying or maintaining any single stage can affect the entire pipeline. While Step Functions are being used to orchestrate multiple Lambda workflows, they are not specifically designed for data pipelines. The UI and YAML configurations only offer a basic view of dependencies between stages, making the overall management of data transformations, dependencies, and pipeline maintenance less intuitive. For example, details such as data schemas or table definitions must still be defined and maintained separately within the code, rather than being natively managed by the orchestration tool.

To address these challenges more efficiently and intuitively, it is highly recommended to adopt an open-source tool that is specifically designed for development of data pipeline, data transformation.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions