How to define transformation logic for multiple datasets #267

kangsoon-dev · 2024-02-09T02:57:16Z

kangsoon-dev
Feb 9, 2024

Hi, I've gone through the framework and I am still trying to understand how the framework scales up for multiple datasets and transformation logic., sorry if it is too basic a question or if the phrasing is not clear.

I observe there is only one StageA repository and one file under the datalakeLibrary for light_transformation_blueprint. If I have a dataset that consist of multiple tables that require different transformation logic at the staging phase, how do I define a lambda function for each table? Or must i use a different pipeline for each dataset with different transformation logic?
How should I be organising my lambda code in the repositories? Or do I create my own lambda functions and specify the arn in the yaml files accordingly?

Thank you!

Answered by cnfait

Feb 12, 2024

Pattern 2

Define a single dataset (unlike pattern 1), and modify sdlf-stageA to update the logic deciding which Lambda to run. If you look at the preupdate-metadata lambda, it fetches the Lambda ARN from DynamoDB and put it in the outputs for use by the next step of the Step Functions.

What you can do is, provide multiple Lambda ARN when defining the dataset:

    rExample:
        Type: awslabs::sdlf::dataset::MODULE
        Properties:
            pPipelineReference: !Ref pPipelineReference
            pTeamName: engineering
            pDatasetName: example
            pPipelineDetails: !Sub >-
                {
                    "main": {
                        "A": {
              …

View full answer

cnfait · 2024-02-12T11:40:22Z

cnfait
Feb 12, 2024

Hi!

light_transformation_blueprint is gone in SDLF 2.x so ignore that entirely - there is no transformation code in sdlf-datalakeLibrary at all. There may be parts of the documentation that still refer to it, I'll update that to avoid confusion.

Pattern 1:

There is also a second pattern below in another comment.

Declare each table as a distinct dataset, specifying which lambda to use in stageA (example with two tables, tableone and tabletwo):

    rTableOne:
        Type: awslabs::sdlf::dataset::MODULE
        Properties:
            pPipelineReference: !Ref pPipelineReference
            pTeamName: engineering
            pDatasetName: tableone
            pPipelineDetails: !Sub >-
                {
                    "main": {
                        "A": {
                            "lambda_arn": "arn:${AWS::Partition}:lambda:${AWS::Region}:${AWS::AccountId}:function:sdlf-engineering-main-process-tableone-a"
                        }
                    }
                }

    rTableTwo:
        Type: awslabs::sdlf::dataset::MODULE
        Properties:
            pPipelineReference: !Ref pPipelineReference
            pTeamName: engineering
            pDatasetName: tabletwo
            pPipelineDetails: !Sub >-
                {
                    "main": {
                        "A": {
                            "lambda_arn": "arn:${AWS::Partition}:lambda:${AWS::Region}:${AWS::AccountId}:function:sdlf-engineering-main-process-tabletwo-a"
                        }
                    }
                }

Then when defining your pipeline, make sure to process all datasets by providing the relevant event pattern:

    rMainA:
        Type: awslabs::sdlf::stageA::MODULE
        Properties:
            pPipelineReference: !Ref pPipelineReference
            pStageName: A
            pPipeline: main
            pTeamName: engineering
            pTriggerType: event
            pEventPattern: >-
                {
                    "source": ["aws.s3"],
                    "detail-type": ["Object Created"],
                    "detail": {
                        "bucket": {
                            "name": ["{{resolve:ssm:/SDLF/S3/CentralBucket}}"]
                        },
                        "object": {
                            "key": [{ "prefix": "engineering/tableone/" }, { "prefix": "engineering/tabletwo/" }]
                        }
                    }
                }
            pEnableTracing: false

Of course this means the sdlf-engineering-main-process-tableone-a and sdlf-engineering-main-process-tabletwo-a Lambda functions need to exist.

0 replies

cnfait · 2024-02-12T12:10:58Z

cnfait
Feb 12, 2024

Pattern 2

Define a single dataset (unlike pattern 1), and modify sdlf-stageA to update the logic deciding which Lambda to run. If you look at the preupdate-metadata lambda, it fetches the Lambda ARN from DynamoDB and put it in the outputs for use by the next step of the Step Functions.

What you can do is, provide multiple Lambda ARN when defining the dataset:

    rExample:
        Type: awslabs::sdlf::dataset::MODULE
        Properties:
            pPipelineReference: !Ref pPipelineReference
            pTeamName: engineering
            pDatasetName: example
            pPipelineDetails: !Sub >-
                {
                    "main": {
                        "A": {
                            "tableone": {
                                "lambda_arn": "arn:${AWS::Partition}:lambda:${AWS::Region}:${AWS::AccountId}:function:sdlf-engineering-main-process-tableone-a"
                            },
                            "tabletwo": {
                                "lambda_arn": "arn:${AWS::Partition}:lambda:${AWS::Region}:${AWS::AccountId}:function:sdlf-engineering-main-process-tabletwo-a"
                            }
                        }
                    }
                }

Of course and again this means the sdlf-engineering-main-process-tableone-a and sdlf-engineering-main-process-tabletwo-a Lambda functions need to exist.

Then in preupdate-metadata make sure to get the relevant value:

...
lambda_arn = transform_info["pipeline"][pipeline][stage][table].get("lambda_arn", lambda_arn)
...

This assumes the table name (tableone or tabletwo in our example) is in a variable called table. The table name can probably be inferred from the S3 key available in the Lambda inputs. As an example, in the workshop: iot/legislators/persons.json is in event["key"] you could get the table name doing

from pathlib import PurePath
PurePath(event["key"]).stem

it will return persons which would be the table name in the workshop.

I would say this is my preferred pattern here, but it depends on what you're comfortable with and other requirements you have I may not be aware of.

2 replies

cnfait Feb 12, 2024

If you're not comfortable modifying one of the "default" SDLF stages you could fork sdlf-stageA - that is, make a copy of it in a different repository, that will be built as a stage by SDLF, and make the changes to that copy. See explanations about custom stages on the workshop.

I'm hoping to improve both documentation and workshop this week!

kangsoon-dev Feb 13, 2024
Author

Thank you! This description is very instructive.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to define transformation logic for multiple datasets #267

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to define transformation logic for multiple datasets #267

Uh oh!

kangsoon-dev Feb 9, 2024

Pattern 2

Replies: 2 comments · 2 replies

Uh oh!

Uh oh!

cnfait Feb 12, 2024

Pattern 1:

Uh oh!

cnfait Feb 12, 2024

Pattern 2

Uh oh!

cnfait Feb 12, 2024

Uh oh!

kangsoon-dev Feb 13, 2024 Author

kangsoon-dev
Feb 9, 2024

Replies: 2 comments 2 replies

cnfait
Feb 12, 2024

cnfait
Feb 12, 2024

kangsoon-dev Feb 13, 2024
Author