Skip to content

airflow tends to zombie tasks that should be successful #614

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
maxgruber19 opened this issue Apr 22, 2025 · 4 comments
Open

airflow tends to zombie tasks that should be successful #614

maxgruber19 opened this issue Apr 22, 2025 · 4 comments

Comments

@maxgruber19
Copy link

system versions: 24.11.0 / 2.9.3

we observed some issues with airflow running with celery executors setting some tasks to "zombie" which ran successfully. some occasions seem to correlate with 20+ dags submitted at once ( most of them are of @daily schedule) but even in case of one dag running alone it's happening.

we increased the pod memory of scheduler and workers to 8 gi (maybe thats still ways to low?) but its still an issue. I did that according to the recommendation in the error message pasted below.

log of an affected task below, the task should be listed as successfull because all the underlying steps have been completed successfully as well.

That issue rather is more a question / request for airflow experience than a typical issue / bug report

I guess you will need further details to tell us more, so please let me know what logs / stats you need to help me 😄

airflow-worker-default-1.airflow-worker-default.mesh-platform-core.svc.cluster.local 
*** Found logs in s3: 
***   * s3://BUCKETNAME/logs/dag_id=protrans/run_id=manual__2025-04-22T13:12:23.151597+00:00/task_id=load/attempt=1.log.SchedulerJob.log 
[2025-04-22, 15:25:33 CEST] {sched.py:151} ERROR - Detected zombie job: {'full_filepath': '/stackable/app/git/current/stages/int/apps/product-protrans/dags/protrans.py', 'processor_subdir': '/stackable/app/git/.worktrees/181cc3caac63f51937ffdbf7851137d5f0fd0b49/stages/int/apps', 'msg': "{'DAG Id': 'protrans', 'Task Id': 'load', 'Run Id': 'manual__2025-04-22T13:12:23.151597+00:00', 'Hostname': 'airflow-worker-default-1.airflow-worker-default.mesh-platform-core.svc.cluster.local', 'External Executor Id': '87a1cea8-0e6f-47f8-9d0f-bbb6bac1c214'}", 'simple_task_instance': <airflow.models.taskinstance.SimpleTaskInstance object at 0x7fbd99eec760>, 'is_failure_callback': True} (See https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/tasks.html#zombie-undead-tasks)
@maxgruber19 maxgruber19 changed the title airflow tends to zombie tasks that are successfull airflow tends to zombie tasks that should be successful Apr 22, 2025
@adwk67
Copy link
Member

adwk67 commented Apr 24, 2025

Hi @maxgruber19 : we had to increase the memory for airflow 2.10.4 (ok, so not 2.9.3) for our demo, even though the release notes didn't highlight anything that might require this, so OOMs could be a possible cause. And that was the webservers, which was a little unexpected. If this happens regularly enough, you can try and catch it with something like:
kubectl get events -A --watch| grep OOM

@adwk67
Copy link
Member

adwk67 commented Apr 24, 2025

Are you using gitsync to fetch your DAGs, by any chance?

@maxgruber19
Copy link
Author

@adwk67 thanks once again, I'll try increasing resources especially for the webservers, will let you know about the outcome

Yes, all dags come via gitsync

@adwk67
Copy link
Member

adwk67 commented Apr 25, 2025

Yes, all dags come via gitsync

Depending on the size/scale of the DAGs, this can cause issues: we have an open issue here. If this seems to be the problem and it can be overcome with the overrides described in that issue, do let me know and we can prioritize that issue accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants