Post asset_check
operations
#28717
FredrikBakken
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi there! I've been working my way around the Dagster documentation and codebase lately to try to find a well-suited option for my current problem, without success. As I result, I present it as a possible idea 😊
In advance - I am sorry for the length of this, but I wanted to write it out enough to fully explain this edge-case.
Short summery: In the case of using Dagster in combination with a Git-based technology for data lakes such as e.g. LakeFS or Project Nessie, I would like to persist my asset's data in an isolated branch, execute all asset checks onto this branch, and then merge the branch once all
asset_check
-operations are successful.Setup example
Since I do not want to have all my assets define the branch-operations, I've off-loaded these to an
IOManager
in this way (summarized):This way, the
@dg.asset()
's defined in the codebase can focus on the asset-based ingestion/transformation operations, without duplicating branch-operation logic, e.g.:Then I add a set of necessary
asset_check
-definitions with the decorator:In a run, after the
asset
has been materialized, and allasset_check
-operations have completed successfully, I want a post asset-check-operation to be executed that merges the new branch into the source branch (e.g.main
).Alternatives looked into so far
Letting the downstream asset handle the
merge
anddelete
operations of the branch. Example updates to theBaseLakeFSIOManager
defined above:The problem I've encountered with this approach is that when I have partitioned assets and the branch-name is defined by the asset name and partition information, then the downstream asset must use the same partition definition, which is not always the case. A work-around for this is the add a redundant asset at the end of the partition-based assets to ensure the changes are merged, but this seems like an unwanted work-around.
Adding a
@dg.run_status_sensor
to monitor theasset
-materialization andasset_check
-operations and then trigger a separate@dg.op
for themerge
anddelete
operations (inspired by: Creating an Asset Check Sensor to get Slack notification whenever an Asset Check fails #21281):The idea with this approach is then to add an arbitrary automation condition that can check that the branch is actually merged before triggering the downstream asset. However, even though this should work, I suspect there are some differences in behavior when running this automatically with a schedule or by a sensor vs. manually triggering the asset materialization, which is not wanted when a materialization has to be re-triggered because of unexpected failures.
Based on this edge-case, my suggestion is then to add support for some post-operations after successfully running asset materialization and its
asset_check
-operations.Beta Was this translation helpful? Give feedback.
All reactions