Redesign the scheduler logic to avoid starvation due to dropped tasks in critical section #49160

Asquator · 2025-04-12T15:49:14Z

Asquator
Apr 12, 2025

The way critical section works now is:

Fire a select query and get at most max_tis task instances to schedule
Loop over tasks to check concurrency limits and find tasks eligible to scheduling
If at least one task instance is found, exit and send the good tasks to executors
Otherwise, update starved_ filters and try again

The third step can cause any amount of tasks to be dropped due to concurrency limits (as long as there is at least one ready task found), and only few tasks will survive. At the same time, ready tasks will queue up in the table without getting the chance to run. This can cause tasks to starve for a long time in edge cases like almost full prioritized pools, as pointed out here:

#45636

We have to rethink the scheduler logic (the query or the loop altogether) to avoid this kind of starvation.

Nataneljpwd · 2025-04-13T09:27:10Z

Nataneljpwd
Apr 13, 2025

It looks like the simpler solution here would be to design the query in such a way that it always gets max_tis tasks that can be sent to run, this can be done by windowing over pools, which will solve the issue linked, however, I think that the scheduler prioritization should also be changed, as if we have the same situation as in the issue but on the same pool, we will get the same result.

I thought about changing the prioritization to a weight, so that the lower prioritized tasks will also get to run, for example:

Assume:
Dag A - 1000 tasks with priority 5 concurrency 10
Dag B - 100 tasks with priority 2 concurrency 3
Same pool, short tasks

As of now, airflow will first complete running Dag A before it starts running Dag B as the query will return 32 tasks with the highest priority and Dag B will starve.

If we change the priority to be a weight, meaning that Dag A will get 5 slots for every 2 slots Dag B gets, both Dags will get to run at the same time and will complete faster overall, this will require a window function over pools and priorities.

There are some edge cases, as the example above but with priorities of 100 and 1, this can be solved by trying to give at least 1 slot for each priority, in which case, what if we have more than 32 priorities? Choose the top 32 and work only on them?

A possible solution is to add a configuration to allow for a maximum amount of tasks scheduler for top priority or just ignore the numbers itself and go of the largest, second largest and so on, while splitting the tasks available as fairly as possible according to the priority (which has problems of its own as what if we decided to schedule 4 tasks for given priority but there are only 2 tasks to schedule? Give it to the next in line? Or to the most prioritized?)
The implementation could be made fully configurable and with a strategy pattern for the leftover slots, however it might not be the best option as it will add complexity to the system (ideally the system will dynamically decide what to do).

I would love to hear any suggestions you might have either for simplification or improvement.

0 replies

Asquator · 2025-04-13T15:52:37Z

Asquator
Apr 13, 2025
Author

A good point is to clarify the ultimate goal we're striving to achieve. So far, always picking max_tis tasks when it's possible sounds sane, as there is no obvious reason to not schedule a ready task. Or is there? If there is such reason, we have to state that explicitly.
Currently, the reasons for dropping tasks are the following filters:

starved_dags: set[str] = set()
starved_tasks: set[tuple[str, str]] = set()
starved_tasks_task_dagrun_concurrency: set[tuple[str, str, str]] = set()

Note that they are set to be empty, and only updated dynamically for the "rare" case that we will get to the second iteration. One of them is an exception:

starved_pools = {pool_name for pool_name, stats in pools.items() if stats["open"] <= 0}

As we check some edge case of a fully-occupied pool. It doesn't really matter, as each task specifies its own priority, and the weights are unrelated to pools or any other concurrency limit in our model. In certain configurations this condition will be always unmet (assume sum(task.pool_slots) != pool["total_slots"] for some pool), so starved_pools joins the party.

Now we get at least 4 conditions that cause tasks to be dropped. We can enhance the query to window over every possible filter (or some subset of them) and avoid looping at all, knowing that the query returns just the good tasks, though windowing over numerical values may get us into cardinality issues and slow SQL performance.

To explain it better, look at the following scenario:
Dag A - 10000 tasks with concurrency 5
Dag B - 10 tasks with concurrency 10
Same pool, same priority

In the first scheduler iteration, 5 tasks from Dag A are scheduled, max_tis - 5 are dropped. In subsequent iterations, we'll have the chance to schedule tasks from Dag B. With current implementation, assuming one A slot is freed up in every scheduler run, and if we window just by pool and priority but not the concurrency, the chance of picking B tasks will be small and we'll spin several hundreds of iterations, dropping A tasks and waiting for random mercy, while technically being able to schedule B tasks. The phenomenon can also occur if we have a DAG with low max_active_tis_per_dagrun firing a lot of mapped tasks, most of which will be dropped over next loop runs.

Dropped tasks inevitably cause delays in scheduling, and despite Airflow not being a real time system, this effect doesn't look like a desirable one, having a plenty of reproducible edge cases which occur in real data systems. I wonder if there is an agreement on max_tis == desired_tis (at least logically) in the community, so we can formally model the desired mechanism.

The main problem here is the inability of the scheduling mechanism to handle all the concurrency policies right due to task dropping and wasting critical sections. Currently the only practical solution is to increase the number of schedulers until we bump into the bottleneck of the critical section which won't be capable to withstand the heavy load. We can try to compute the starved_ filters before the select query (which is basically windowing), instead of detecting them each time after the query is fired. Another path is to relieve the parameters of the system (like transforming priority to weights) to make the query lighter. Anyway, benchmarking is required.

1 reply

Nataneljpwd Apr 14, 2025

I agree with most of the things you said, however, I do think that instead of a window over every concurrency requirement, we iterate over every holder of such requirement, so instead of a window over concurrency, we window over Dag, which in turn solves the concurrency issue, if we do iterate over concurrency you may get a similar problem when having 2 dags, same pool and concurrency so it is better to instead window over the holder of the requirement (i.e Dag for concurrency).

About the slow SQL, I do not agree here, the SQL scales that are expected when using airflow are not at such a large scale that cardinality will cause a major runtime increase for the query, I think that instead of iterating and sending a lot of queries it will be far more efficient to send one single big query, it will most likely reduce the amount of time in the critical section as we send less queries and use a more efficient plan to check the filters rather than doing it in python, windowing can be very fast as long as the ordering is chosen carefully to fit the window function.

Transforming priority to weight can also be a very good idea, though in the case that the query will do more things and the code will do less, it will be harder to implement it properly in SQL, while allowing different strategies so that users will be able to tailor the prioritization behaviour to their needs.

If you have any new suggestions, improvements or possible solutions, I would be happy to hear them.

Asquator · 2025-04-19T18:26:56Z

Asquator
Apr 19, 2025
Author

Yep, so far the most promising and scalable approach is "window over everything" which gives us granularity at the level of individual scheduling entities, or objects that have parametrized limits taken into account - these are DAG runs and pools. We apparently have to sort the tasks by the policy defined fields, partition by combinations of limit holder fields, and try to stuff as many tasks as we can in each such window.

We have to consider:
pool_slots (per pool)
max_active_tis_per_dag (per DAG),
max_active_tasks (per DAG run),
max_active_tis_per_dagrun (per DAG run)

Here's a pseudo query I wrote to demonstrate this idea:

select *,

sum(ti.pool_slots) over (
	partition by pool
	order by (-prw, date, idx)
	between all preceding and current
) as slots_taken,

count(ti) over (
	partition by dr.id
	order by (-prw, date, idx)
	between all preceding and current
) as dagrun_total_active_ti,

count(ti) over (
	partition by dag.dag_id, ti.task_id
	order by (-prw, date, idx)
	between all preceding and current
) as dag_ti_count,

count(ti) over (
	partition by dr.id, ti.task_id
	order by (-prw, date, idx)
	between all preceding and current
) as dagrun_ti_count,

from task_instance ti 
join dag_run dr on ti.run_id = dr.run_id
join dag on dr.dag_id = dag.id
join slot_pool as pools on ti.pool = slot_pool.pool

where
slots_taken <= pools.slots and
dagrun_total_active_ti <= dag.max_active_tasks and
dag_ti_count <= ??? max_active_tis_per_dag ??? and 
dagrun_ti_count <= ??? max_active_tis_per_dagrun ???

limit max_tis

The ??? means the field is not in the DB, it's currently computed manually into a concurrency map which complicates the matters. Mb it's worth to store these as temporary tables instead, or use a with clause.

I hope I got the right notion of the data models since the fields are vaguely documented and I relied solely on their names, feel free to correct me if there are mistakes.

We can replace count with row number as they will be the same

0 replies

eladkal · 2025-05-08T11:30:00Z

eladkal
May 8, 2025
Collaborator

Closing to keep the discussion in one place.
Please continue in ~~#49508~~
#45636

0 replies

Redesign the scheduler logic to avoid starvation due to dropped tasks in critical section #49160

Uh oh!

Uh oh!

Asquator Apr 12, 2025

Replies: 4 comments · 1 reply

Uh oh!

Nataneljpwd Apr 13, 2025

Uh oh!

Uh oh!

Asquator Apr 13, 2025 Author

Uh oh!

Nataneljpwd Apr 14, 2025

Uh oh!

Uh oh!

Asquator Apr 19, 2025 Author

Uh oh!

Uh oh!

eladkal May 8, 2025 Collaborator

Asquator
Apr 12, 2025

Replies: 4 comments 1 reply

Nataneljpwd
Apr 13, 2025

Asquator
Apr 13, 2025
Author

Asquator
Apr 19, 2025
Author

eladkal
May 8, 2025
Collaborator