High cardinality partitions / a unit of an Asset #28678

milicevica23 · 2025-03-22T14:44:24Z

milicevica23
Mar 22, 2025

TLDR

Dagster has a soft limitation with 25k partitions, and we are seeking a solution on how to represent a use case that has easily more then 25k partitions
Use case: The external system is pushing PDFs to a bucket from which we want to extract text, extract topics, and create a dashboard.

About

Hi, we have been working with dagster full-time for a year now, and we are very happy with the current status of partitions for our current use case, which is mainly regarding data and data transformation.

We now want to take the next step and implement a different type of use cases in dagster, and we seek a solution on how to present assets and think about the use case in an asset-based fashion. Let me first explain the use case, then discuss the limitations and current understanding of the problem and come up with a possible solution.

Feel free to challenge everything along the way and contact me to discuss the solution.

Use case

The external system is pushing PDFs to a bucket from which we want to extract text, extract topics, and create a dashboard.

I tried to sketch this here: https://excalidraw.com/#room=d11658db5f0752a6ae7b,EgxXt-S8D8NW8UnEnY36vA

As you can imagine, if we take a PDF name as ID for a partition, we can easily have more than 25k partitions and therefore hit the limit inside dagster
To make it more complex, imagine there are two steps in this process, and every step takes a very long time and a lot of compute
Current dagster partitions implementation: https://docs.dagster.io/guides/build/partitions-and-backfills/partitioning-assets

Current solution/workaround

Create a partition for each PDF where the PDF name is partition ID -> not ideal and reaching the limit
Partition PDFs into folders per day/hours -> Not ideal for backfill, and some PDFs would be done multiple times

Additional problems

How do you map two assets over the same partition?
How do you backfill for one PDF?

Data partitions

Partitions are a concept that comes from having something big and then making it into smaller chunks. They are normally used in the data world and data warehouse, where partitions should never be more than X rows or megabytes. Otherwise, you will end up with very small chunks, which will hurt your performance.
In my opinion, dagster is already great in such types of partitions, but I think for this use case, we have to go from that thinking away and introduce something new. I would call it a unit

A unit of asset

A unit has an ID and timestamp when the unit was created. It is one piece of work that dagster has to do in order to have a bigger asset. The timestamp is used for dagster internals to optimize the UI and give some logical ordering to units. For example, it shows the last 30 Units and internally partitions data over timestamps for easier retrieval. Users are normally interested in specific units; therefore, dagster internals should have an index over unit IDs. For a user, units are not sorted in a specific order, but normally but normally a user is interested in newer units.

Dependency management

If two assets have similar DynamicalUnit definitions, then they should be connected over partition ID and should execute in order they depend on each other, and in general, one unit should be done as soon as it is created

Backfill

If we introduce a new function to process units again, we can say run all the units in last 3 days and the timestamp can be used for this

ion-elgreco · 2025-03-22T15:57:28Z

ion-elgreco
Mar 22, 2025

A couple unordered idea's I have:

Create new assets during runtime in defs.py by retrospectively checking the assets partition sizes in the database:
- Create duplicates of your pdf_asset. eg. extract_text_asset_1, extract_text_asset_2 etc.
- In your sensor you just check if an asset is empty, if so, add partitions on that one. If it's full continue with the next duplicate asset which has room left.
- extract topic asset should have an automationConditition, so that it materializes the partition once upstream in extract_text_asset is materialized.
extract_text_asset writes into deltatable partitioned on day perhaps
- if extracting a topic asset from the text is not that compute intensive, reprocessing a batch of files of a single day shouldn't be an issue
- you can use an automation cron to then process these files by day
- Downside, you lose transparency of which files are in that day
treat extract_text_asset as an unpartitioned asset that receives a runtime config FileConfig with the file path.
- This is a pattern I've used before albeit with less incoming files. You have a sensor that checks for files and triggers downstream
  assets by passing the config of the file that's needs to be processed. You lose a bit of overview in the dagster ui, but if a pipeline failed you just have to retrigger it

3 replies

milicevica23 Mar 22, 2025
Author

Hi @ion-elgreco, thank you very much for answering

All of the ideas sound feasible to me, but also at the same time, they are workarounds and not native solutions
I would go with the third proposal, but we disregard all the nice stuff dagster offers
Somehow my ideal implementation would be same but instead input to function it is input for partition/unit metadata to dagster

I want to discuss more if we could do everything in dagster what would be an ideal solution for you?

I have to admit I understand Dagster internals, just high level, but maybe dagster people can help us if we give them some ideas 😁

ion-elgreco Mar 22, 2025

The problem is dagster is python for internals and react and typescript Frontend.

The 25000 partition limit is afaik a performance issue for the UI, if only they used rust for the internals it would be likely not an issue :)

milicevica23 Mar 22, 2025
Author

I think you do not have to put them all on the screen if we can partition them in the background and paginate them in the UI
Immutable timestamp for a unit and index for id to search over

Maybe I just dont have enough overview of the internals

Btw i would like to see rust in dagster 😎

danielgafni · 2025-03-22T19:19:26Z

danielgafni
Mar 22, 2025

I have been building data pipelines with similar requirements at 2 jobs.

My opinion on this matter is that Dagster isn’t suppose to manage individual files at that scale. It’s simply not designed for that.

The reasons for these are:

performance implications for Dagster itself: Dagster’s DB will be the bottleneck for many operations, and it’s very likely that it will not handle the load.
inefficient compute use: you will have to spin up a whole Dagster run just to process one file. You are lucky if your run launcher is performant (I image a Ray run launcher would work quite well). Using something like a K8s run launcher will make backfills almost impossible.
you may think that using dynamic steps may be a solution here, but it’s also not true. The biggest problem with them right now is that they are launched sequentially, so it may take hours to just launch all these steps.

Dagster is a very good orchestrator, but it isn’t a scaling platform.

I suggest decoupling orchestration from compute and to use another tools (I’ve been using Ray) for the actual processing.

I had to build the metadata layer myself in order to do that. I am using a separate Postgres DB which stores the state of the data processing pipeline on the level of individual files. My assets are partitioned on “large” partitions (pick something which makes sense for your domain) and not on individual files. When launching a new processing run, I can access the DB, read the current pipeline state, and identify files which require processing. Then scale this processing with Ray.

I actually gave a talk about this setup at one of the Dagster community meetups a while ago. My current setup is a bit different, and I am pretty happy with the current state!

P.S. Obviously, there are multiple ways to store and use the pipeline state: something like Kafka would work too. The main idea that I’m trying communicate is that you need to build this layer yourself and you need an external system to store the state.

4 replies

milicevica23 Mar 22, 2025
Author

I understand that having to run for each file a dagster run is heavy, but maybe dagster itself can implement such a concept as you describe. You needed it twice and i need it at least once, maybe enough for dagster to make it 😀 We can discuss how to optimize e.g say every 15 min i run an asset and get an iterator/list with the units which doesn’t have a completion event. After execution we emit completion event and it is nicely shown in UI. If somebody want to run it per partition/unit is able to do so and it can take longer

danielgafni Mar 22, 2025

While this is definitely possible (and already somewhat supporting with partition ranges), I am still skeptical about dealing with millions of events (perhaps just per one asset) in Dagster. Maybe with an OLAP event backend (specifically for this feature) 🤔

milicevica23 Mar 22, 2025
Author

I would for sure not expect kafka -> microservice -> kafka scale
But more something in some human scale creation speed

And having it lot is not a problem to throttle it is much more to schedule and process it correctly over time

geoHeil Mar 23, 2025

There is also https://temporal.io/ as an option - but not really a dagster native option.

prha · 2025-03-25T22:57:45Z

prha
Mar 25, 2025
Maintainer

@milicevica23 Thanks for this discussion. I've been thinking a lot about partitions recently.

Some questions for you:

Are you using these partitions for anything beyond scheduling? Is it important that we communicate the status of any particular partition, or just that there is a new partition that needs to be processed?
Is the event history important for any particular partition? Or is it generally one event per partition?
Is the set of partitions monotonically increasing? Would you ever update your partitions definition to remove old partitions?
If you represented each partition as a distinct asset, what would be more painful? Asset selection? UI clutter?

2 replies

geoHeil Mar 26, 2025

Scheduling

downstream invocations of other similarly partitioned assets
roll-ups from document-based partitions to other (temporal i.e. daily partitions)

other uses

carrying metadata over on the partition level like processing time, or number of pages of the document

Would the approach with just a new partition to be processed be enough to allow for downstream partitioned scheduling?

It is rather one partition per document/event

the set is rather immutable/monotonically increasing. Speaking in the terms of the above example - the AI step may get refined (better). This may lead to re-processing all previously seen documents/partitions.

In some cases (rinbguffer) there is a aging off where after some specific legal data retention period the raw data has to be deleted. Not necesarily the metadata - but keeping that around would probably slow down dagster unnecesarily.

If you represented each partition as a distinct asset, what would be more painful? Asset selection? UI clutter?

I believe also the asset graph (UI) in dagster has some limits with regards to scalability. Besides that - the UI would be rather unusuable I imagine - and also the neat declarative automation bsed on high-level states (assets) would fall apart so something more complex (probably custom policies based on asset prefix names) would be required. And overall the whole pipeline becomes les clear. Also, a reprocessing (in case the AI step gets refined) would be rather tricky no trivial selection. I gues with some code based selection that could still work - but it feels not right to blow up the graph like that. What would be the limitations for the number of assets backend + UI wise?

prha Mar 27, 2025
Maintainer

Right now, we're exploring the space to see if there's a missing abstraction that could fit the needs of static/dynamic partitions better. This might be adding more features to asset grouping / asset selection.

Right now, we're seeing some signs that partitions might be overly abstracted between static/dynamic partitions and time-based partitions. We'd love to provide better range-native support for time-based partitions, both in terms of UI and in terms of backend storage.

High cardinality partitions / a unit of an Asset #28678

Uh oh!

Uh oh!

milicevica23 Mar 22, 2025

TLDR

About

Use case

Current solution/workaround

Additional problems

Data partitions

A unit of asset

Dependency management

Backfill

Replies: 3 comments · 9 replies

Uh oh!

Uh oh!

ion-elgreco Mar 22, 2025

Uh oh!

Uh oh!

milicevica23 Mar 22, 2025 Author

Uh oh!

ion-elgreco Mar 22, 2025

Uh oh!

Uh oh!

milicevica23 Mar 22, 2025 Author

Uh oh!

Uh oh!

danielgafni Mar 22, 2025

Uh oh!

Uh oh!

milicevica23 Mar 22, 2025 Author

Uh oh!

Uh oh!

danielgafni Mar 22, 2025

Uh oh!

milicevica23 Mar 22, 2025 Author

Uh oh!

geoHeil Mar 23, 2025

Uh oh!

prha Mar 25, 2025 Maintainer

Uh oh!

geoHeil Mar 26, 2025

Uh oh!

prha Mar 27, 2025 Maintainer

milicevica23
Mar 22, 2025

Replies: 3 comments 9 replies

ion-elgreco
Mar 22, 2025

milicevica23 Mar 22, 2025
Author

milicevica23 Mar 22, 2025
Author

danielgafni
Mar 22, 2025

milicevica23 Mar 22, 2025
Author

milicevica23 Mar 22, 2025
Author

prha
Mar 25, 2025
Maintainer

prha Mar 27, 2025
Maintainer