Launching a single RunRequest for a set of partitions from a schedule or sensor #19457

somiandras · 2024-01-28T21:10:39Z

somiandras
Jan 28, 2024

Is there a way to launch a single RunRequest for a set of partitions from a schedule or sensor, just like you can do a single run backfill on the UI?

I have a partitioned asset with dynamic partitions (7-800 partitions) and I want to periodically refresh the table for some or all the available partitions (essentially to have my own replica of a dataset provided via an API). The asset itself can work with partition_keys (can handle BackfillPolicy.single_run()), as refreshing a single partition only takes a relatively lightweight API call, and I have a neat resource wrapping the API, so it’s easy to do all partitions within a single run (maybe even async).

But I can’t seem to find a way to launch just one run for multiple partitions from a schedule. All the examples I saw iterate over the partition keys and create a RunRequest for each, but that won’t work for me here, because it creates too much overhead with hundreds of runs that in turn trigger hundreds of unnecessary auto-materializations of downstream assets and checks, with a flurry of Slack notifications, etc., while the whole business could be finished in a matter of seconds with multiple partitions on a single run.

ytoast · 2024-01-30T03:45:08Z

ytoast
Jan 30, 2024

Maybe this #14622?

1 reply

somiandras Jan 30, 2024
Author

Not sure how to apply that to dynamic partitions where keys are strings (and I'm not even refreshing them in alphabetical order).

AdrianTheopold · 2024-09-11T14:07:11Z

AdrianTheopold
Sep 11, 2024

I have the same issue. In short:
An asset with roughly 4k partitions of which occasionally only a sparse selection needs to be refereshed.
The cost for materializing a single partition of the asset are roughly:
1 part initialization cost of asset resources / 3 parts computational cost of the asset itself.
Therefore it is advised to do multiple partitions in a single run in order to cut on the initialization cost.

Now to the problem:
It seems that Dagster does not currently support partition_keys as an argument to RunRequest.
Using the method in #14622 is working for dynamic partitions however it requires the partitions in the request to be a continuous range inside the whole partions definition.
This can be seen when backfills are launched from the UI for assets with BackfillPolicy.single_run() and the selected partitions do not belong to a single range, then multiple runs (one for each continuous range) are started.

The quite ugly workaround for me is to:

@sensor(job=...)
def sensor_do_job(
    context: SensorEvaluationContext,
):
    partition_keys = get_partition_keys_from_some_resource()
    if len(partition_keys) > 1:
        context.log.info(f"Clearing {len(partition_keys)} partitions")
        for partition in partition_keys:
            context.instance.delete_dynamic_partition(
               partitions_def.name, partition
            )
        context.log.info(f"Adding {len(partition_keys)} partitions")
        context.instance.add_dynamic_partitions(
            partitions_def.name,
            partition_keys=partition_keys,
        )
        return RunRequest(
            tags={
                "dagster/asset_partition_range_start": partition_keys[0],
                "dagster/asset_partition_range_end": partition_keys[-1],
            }
        )
    elif len(partition_keys) == 1:
        return RunRequest(partition_key=partition_keys[0])

By deleting and then reinserting, the partition_keys the are appended as a continuous range and can therefore be run using tags.

However this solution comes with the drawback that if in the process of deleting and adding the partition_keys another application is using the same partitions_def it might occur that some partition_keys are not present for a short moment which might lead to unexpected failures.
This can be mitigated by using different partitions_def for different assets/asset groups, however this might pose a problem when assets are combined with other upstream/downstream assets.

I really think that assets which fill multiple partitions with a single run are really missing an important feature here.
I would love to not use this kind of single_run asset definition but resource wise it poses extreme benefits and I don't see a workaround inside the dagster framework.

In my specific usecase the initialization cost stems from a machine learning model that is loaded from cluster storage into memory for inference. One possibility would be to serve this model to an endpoint in our Kubernetes cluster and therefore eleminate the cost in setting up the resource. However this would come with some additional structural overhead and would eleminate in parts the ease of use of dagster.

Would be very interested in how others solved the problem of initialization cost or potential other workarounds of the mentioned problem. Thanks!

4 replies

OwenKephart Sep 25, 2024
Maintainer

Hi @AdrianTheopold -- definitely hear you that this is a big gap at the moment, and certainly on our radar to add support for / something we have discussed potential implementation strategies for (although I don't have a specific timeline for when this work would start).

The workaround you have is clever and I don't believe there is any better solution at the moment.

the4thamigo-uk Oct 14, 2024

Would be very interested in how others solved the problem of initialization cost or potential other workarounds of the mentioned problem. Thanks!

In my case I wanted to do the same thing but for different reasons, in particular avoiding the overhead of spawning many separate kubernetes processes.

I too have found this incredibly frustrating. In my case I could see no alternative than to create a separate asset that first works out the partitions that need refreshing, and then loops through each of the partitions in this list and calls materialize() on each one for the asset I want to materialze.

Although in this case a 'run' is actually created for each materialize() call, but importantly it is an 'ephemeral' run which executes in the process of the parent asset, and so solves the problem I was having with kubernetes. I'm not sure it would necessarily solves the problems you are having with auto-materialization though.

Please, dagster team, can this feature be added. I love dagster in general, and I am very grateful for all the work that is done to extend the breadth of the product, but I often seem to struggle with getting some basic behaviours that I need from it.

Sharpieman20 Apr 9, 2025

+1 to what @the4thamigo-uk said, my overall experience has been great but I think for more people to replace with Airflow a bit easier experience for managing large datasets and complex workflows would help. That has been the main frustration for me so
far. Seems like that's on the team's roadmap from everything I've read though, so I'm hopeful it continues to improve.

hnykda May 26, 2025

Oh please add this, such an unexpected hole in dagster usability for me.

Sharpieman20 · 2025-04-09T15:24:48Z

Sharpieman20
Apr 9, 2025

Anyone aware of a workaround or fix for this that doesn't involve deleting and reinserting?

I've just gone through the process of writing code to automatically create dynamic assets to deal with the 25k partition limit - I have ~50,000 total partitions. I re-use these dynamic partitions across multiple assets, so deleting and re-appending is certain to cause issues.

I preferred to not use single_run as well, but in my case it's even more severe than for @AdrianTheopold, it's around 90% initialization and 10% actual work to do. So not doing it I'm looking at a 10x to the run time, probably even more when you add on the additional load on dagster daemons.

One option is to use a different partitions_def on each asset, but then I'm adding even more duplication. That's probably the route I'll go with if there isn't a better alternative though.

EDIT: I decided to still share the partitions, but just clear all of them out and re-insert now. Fortunately the main order in which I want to keep inserting is by date+id ascending, so I should be able to get away with making sure that I always create them in order from now on.

0 replies

Launching a single RunRequest for a set of partitions from a schedule or sensor #19457

Uh oh!

somiandras Jan 28, 2024

Replies: 3 comments · 5 replies

Uh oh!

ytoast Jan 30, 2024

Uh oh!

somiandras Jan 30, 2024 Author

Uh oh!

Uh oh!

AdrianTheopold Sep 11, 2024

Uh oh!

OwenKephart Sep 25, 2024 Maintainer

Uh oh!

Uh oh!

the4thamigo-uk Oct 14, 2024

Uh oh!

Sharpieman20 Apr 9, 2025

Uh oh!

hnykda May 26, 2025

Uh oh!

Uh oh!

Sharpieman20 Apr 9, 2025

somiandras
Jan 28, 2024

Replies: 3 comments 5 replies

ytoast
Jan 30, 2024

somiandras Jan 30, 2024
Author

AdrianTheopold
Sep 11, 2024

OwenKephart Sep 25, 2024
Maintainer

Sharpieman20
Apr 9, 2025