Post `asset_check` operations #28717

FredrikBakken · 2025-03-25T08:27:39Z

FredrikBakken
Mar 25, 2025

Hi there! I've been working my way around the Dagster documentation and codebase lately to try to find a well-suited option for my current problem, without success. As I result, I present it as a possible idea 😊

In advance - I am sorry for the length of this, but I wanted to write it out enough to fully explain this edge-case.

Short summery: In the case of using Dagster in combination with a Git-based technology for data lakes such as e.g. LakeFS or Project Nessie, I would like to persist my asset's data in an isolated branch, execute all asset checks onto this branch, and then merge the branch once all asset_check-operations are successful.

Setup example

Since I do not want to have all my assets define the branch-operations, I've off-loaded these to an IOManager in this way (summarized):

import dagster as dg
from lakefs.client import Client

class BaseLakeFSIOManager(dg.ConfigurableIOManager):
    repository_id: str
    source_branch: str

    def client(self) -> Client:
        return Client(
            host=dg.EnvVar("LAKEFS_HOST").get_value(),
            username=dg.EnvVar("LAKEFS_USERNAME").get_value(),
            password=dg.EnvVar("LAKEFS_PASSWORD").get_value(),
        )

    def create_branch(
        self,
        target_branch: str,
    ) -> None:
        ...

    def commit_changes(
        self,
        context: dg.OutputContext,
        target_branch: str,
        message: str,
    ) -> None:
        ...

    def handle_output(self, context: dg.OutputContext, obj: Any) -> None:
        # Create new branch
        self.create_branch("branch_name")

        # Persist data to new branch
        self._handle_output(context, obj, "branch_name")

        # Commit changes to branch
        self.commit_changes(context, "branch_name", "feat: add new data")

    def load_input(self, context: dg.InputContext) -> Any:
        ...

    def _handle_output(
        self,
        context: dg.OutputContext,
        obj: Any,
        branch_name: str,
    ) -> None:
        raise NotImplementedError("Subclasses must implement _handle_output.")

    def _load_input(
        self,
        context: dg.InputContext,
    ) -> Any:
        raise NotImplementedError("Subclasses must implement _load_input.")


class LakeFSPolarsIOManager(BaseLakeFSIOManager):
    repository_id: str
    source_branch: str

    def __init__(self, repository_id: str, source_branch: str):
        super().__init__(repository_id=repository_id, source_branch=source_branch)

    def _handle_output(
        self,
        context: dg.OutputContext,
        obj: pl.DataFrame,
        target_branch: str,
    ) -> None:
        store = self.s3_client(bucket=self.repository_id)

        path = f"{target_branch}/my_file.parquet"
        context.log.info(f"Path: {path}")

        buffer = io.BytesIO()
        obj.write_parquet(buffer)
        buffer.seek(0)

        obs.put(store, path, buffer)

    def _load_input(
        self,
        context: dg.InputContext,
    ) -> pl.DataFrame:
        return pl.read_parquet("...")

This way, the @dg.asset()'s defined in the codebase can focus on the asset-based ingestion/transformation operations, without duplicating branch-operation logic, e.g.:

import dagster as dg
import polars as pl

@dg.asset(
    compute_kind="python",
    partitions_def=tables_partitions_def,
    io_manager_key="polars_manager",
)
def raw(
    context: dg.AssetExecutionContext,
) -> pl.DataFrame:
    partition_key = context.partition_key
    context.log.info(f"Partition key {partition_key}")

    df = pl.read_csv("...")

    return df

Then I add a set of necessary asset_check-definitions with the decorator:

import dagster as dg

@dg.asset_check(
    asset=raw.key,
    blocking=True,
)

In a run, after the asset has been materialized, and all asset_check-operations have completed successfully, I want a post asset-check-operation to be executed that merges the new branch into the source branch (e.g. main).

Alternatives looked into so far

Letting the downstream asset handle the merge and delete operations of the branch. Example updates to the BaseLakeFSIOManager defined above:

class BaseLakeFSIOManager(dg.ConfigurableIOManager):
    ...

    def merge_changes(
        self,
        context: dg.InputContext,
        target_branch: str,
    ) -> None:
        ...

    def delete_branch(
        self,
        context: dg.InputContext,
        branch: str,
    ) -> None:
        ...

    def load_input(self, context: dg.InputContext) -> Any:
        self.merge_changes(context, "branch_name")
        self.delete_branch(context, "branch_name")

        return self._load_input(context)

    def _load_input(
        self,
        context: dg.InputContext,
    ) -> Any:
        raise NotImplementedError("Subclasses must implement _load_input.")

The problem I've encountered with this approach is that when I have partitioned assets and the branch-name is defined by the asset name and partition information, then the downstream asset must use the same partition definition, which is not always the case. A work-around for this is the add a redundant asset at the end of the partition-based assets to ensure the changes are merged, but this seems like an unwanted work-around.

Adding a @dg.run_status_sensor to monitor the asset-materialization and asset_check-operations and then trigger a separate @dg.op for the merge and delete operations (inspired by: Creating an Asset Check Sensor to get Slack notification whenever an Asset Check fails #21281):

import dagster as dg

class LakeFSMergeAndDeleteOpConfig(dg.Config):
    repository_id: str
    source_branch: str
    target_branch: str

@dg.op
def merge_changes(
    context: dg.OpExecutionContext,
    config: LakeFSMergeAndDeleteOpConfig,
) -> None:
    ...

@dg.op(ins={"start": dg.In(dg.Nothing)})
def delete_branch(
    context: dg.OpExecutionContext,
    config: LakeFSMergeAndDeleteOpConfig,
) -> None:
    ...

default_config = LakeFSMergeAndDeleteOpConfig(
    repository_id="raw",
    source_branch="main",
    target_branch="",
)

@dg.job(
    config=dg.RunConfig(
        ops={
            "merge_changes": default_config,
            "delete_branch": default_config,
        }
    )
)
def lakefs_merge_and_delete_changes() -> None:
    delete_branch(start=merge_changes())

# ====

@dg.run_status_sensor(
    run_status=dg.DagsterRunStatus.SUCCESS,
    default_status=dg.DefaultSensorStatus.RUNNING,
    request_job=lakefs_merge_and_delete_changes,
)
def lakefs_merge_and_delete_sensor(
    context: dg.RunStatusSensorContext,
) -> dg.RunRequest:
    if context.dagster_run.job_name == lakefs_merge_and_delete_changes.name:
        return dg.SkipReason("Do not run...")

    run_id = context.dagster_run.run_id
    event_records: Sequence[dg.EventLogRecord] = context.instance.get_records_for_run(
        run_id,
        of_type=dg.DagsterEventType.ASSET_CHECK_EVALUATION,
    ).records

    all_passed = True

    for event_record in event_records:
        dagster_event = event_record.event_log_entry.dagster_event
        if dagster_event and dagster_event.event_specific_data:
            passed = getattr(dagster_event.event_specific_data, "passed", None)
            context.log.info(f"Passed? {passed}")
            if not passed:
                all_passed = False
        else:
            context.log.info("No event-specific data found.")
            all_passed = False

    asset_selection = context.dagster_run.asset_selection
    context.log.info(f"Asset selection: {asset_selection}")

    if all_passed:
        for asset_key in list(context.dagster_run.asset_selection):
            asset_key_path = asset_key.path[0]
            repository_id = asset_key_path.rsplit("_", maxsplit=1)[-1]
            context.log.info(f"Repository ID: {repository_id}")

            default_config = LakeFSMergeAndDeleteOpConfig(
                repository_id=repository_id,
                source_branch="main",
                target_branch=f"{asset_key_path}_{context.partition_key.replace('|', '_')}",
            )

            yield dg.RunRequest(
                run_key=None,
                run_config=dg.RunConfig(
                    ops={
                        "merge_changes": default_config,
                        "delete_branch": default_config,
                    },
                ),
            )

    return dg.SkipReason("Asset check(s) not completed yet or failed.")

The idea with this approach is then to add an arbitrary automation condition that can check that the branch is actually merged before triggering the downstream asset. However, even though this should work, I suspect there are some differences in behavior when running this automatically with a schedule or by a sensor vs. manually triggering the asset materialization, which is not wanted when a materialization has to be re-triggered because of unexpected failures.

Based on this edge-case, my suggestion is then to add support for some post-operations after successfully running asset materialization and its asset_check-operations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Post `asset_check` operations #28717

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Post asset_check operations #28717

Uh oh!

Uh oh!

FredrikBakken Mar 25, 2025

Setup example

Alternatives looked into so far

Replies: 0 comments

Post `asset_check` operations #28717

FredrikBakken
Mar 25, 2025