High cardinality partitions / a unit of an Asset #28678
Replies: 3 comments 9 replies
-
A couple unordered idea's I have:
|
Beta Was this translation helpful? Give feedback.
-
I have been building data pipelines with similar requirements at 2 jobs. My opinion on this matter is that Dagster isn’t suppose to manage individual files at that scale. It’s simply not designed for that. The reasons for these are:
Dagster is a very good orchestrator, but it isn’t a scaling platform. I suggest decoupling orchestration from compute and to use another tools (I’ve been using Ray) for the actual processing. I had to build the metadata layer myself in order to do that. I am using a separate Postgres DB which stores the state of the data processing pipeline on the level of individual files. My assets are partitioned on “large” partitions (pick something which makes sense for your domain) and not on individual files. When launching a new processing run, I can access the DB, read the current pipeline state, and identify files which require processing. Then scale this processing with Ray. I actually gave a talk about this setup at one of the Dagster community meetups a while ago. My current setup is a bit different, and I am pretty happy with the current state! P.S. Obviously, there are multiple ways to store and use the pipeline state: something like Kafka would work too. The main idea that I’m trying communicate is that you need to build this layer yourself and you need an external system to store the state. |
Beta Was this translation helpful? Give feedback.
-
@milicevica23 Thanks for this discussion. I've been thinking a lot about partitions recently. Some questions for you:
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
TLDR
Dagster has a soft limitation with 25k partitions, and we are seeking a solution on how to represent a use case that has easily more then 25k partitions
Use case: The external system is pushing PDFs to a bucket from which we want to extract text, extract topics, and create a dashboard.
About
Hi, we have been working with dagster full-time for a year now, and we are very happy with the current status of partitions for our current use case, which is mainly regarding data and data transformation.
We now want to take the next step and implement a different type of use cases in dagster, and we seek a solution on how to present assets and think about the use case in an asset-based fashion. Let me first explain the use case, then discuss the limitations and current understanding of the problem and come up with a possible solution.
Feel free to challenge everything along the way and contact me to discuss the solution.
Use case
The external system is pushing PDFs to a bucket from which we want to extract text, extract topics, and create a dashboard.
I tried to sketch this here: https://excalidraw.com/#room=d11658db5f0752a6ae7b,EgxXt-S8D8NW8UnEnY36vA
As you can imagine, if we take a PDF name as ID for a partition, we can easily have more than 25k partitions and therefore hit the limit inside dagster
To make it more complex, imagine there are two steps in this process, and every step takes a very long time and a lot of compute
Current dagster partitions implementation: https://docs.dagster.io/guides/build/partitions-and-backfills/partitioning-assets
Current solution/workaround
Additional problems
Data partitions
Partitions are a concept that comes from having something big and then making it into smaller chunks. They are normally used in the data world and data warehouse, where partitions should never be more than X rows or megabytes. Otherwise, you will end up with very small chunks, which will hurt your performance.
In my opinion, dagster is already great in such types of partitions, but I think for this use case, we have to go from that thinking away and introduce something new. I would call it a unit
A unit of asset
A unit has an ID and timestamp when the unit was created. It is one piece of work that dagster has to do in order to have a bigger asset. The timestamp is used for dagster internals to optimize the UI and give some logical ordering to units. For example, it shows the last 30 Units and internally partitions data over timestamps for easier retrieval. Users are normally interested in specific units; therefore, dagster internals should have an index over unit IDs. For a user, units are not sorted in a specific order, but normally but normally a user is interested in newer units.
Dependency management
If two assets have similar DynamicalUnit definitions, then they should be connected over partition ID and should execute in order they depend on each other, and in general, one unit should be done as soon as it is created
Backfill
If we introduce a new function to process units again, we can say run all the units in last 3 days and the timestamp can be used for this
Beta Was this translation helpful? Give feedback.
All reactions