Description
system versions: 24.11.0 / 2.9.3
we observed some issues with airflow running with celery executors setting some tasks to "zombie" which ran successfully. some occasions seem to correlate with 20+ dags submitted at once ( most of them are of @daily schedule) but even in case of one dag running alone it's happening.
we increased the pod memory of scheduler and workers to 8 gi (maybe thats still ways to low?) but its still an issue. I did that according to the recommendation in the error message pasted below.
log of an affected task below, the task should be listed as successfull because all the underlying steps have been completed successfully as well.
That issue rather is more a question / request for airflow experience than a typical issue / bug report
I guess you will need further details to tell us more, so please let me know what logs / stats you need to help me 😄
airflow-worker-default-1.airflow-worker-default.mesh-platform-core.svc.cluster.local
*** Found logs in s3:
*** * s3://BUCKETNAME/logs/dag_id=protrans/run_id=manual__2025-04-22T13:12:23.151597+00:00/task_id=load/attempt=1.log.SchedulerJob.log
[2025-04-22, 15:25:33 CEST] {sched.py:151} ERROR - Detected zombie job: {'full_filepath': '/stackable/app/git/current/stages/int/apps/product-protrans/dags/protrans.py', 'processor_subdir': '/stackable/app/git/.worktrees/181cc3caac63f51937ffdbf7851137d5f0fd0b49/stages/int/apps', 'msg': "{'DAG Id': 'protrans', 'Task Id': 'load', 'Run Id': 'manual__2025-04-22T13:12:23.151597+00:00', 'Hostname': 'airflow-worker-default-1.airflow-worker-default.mesh-platform-core.svc.cluster.local', 'External Executor Id': '87a1cea8-0e6f-47f8-9d0f-bbb6bac1c214'}", 'simple_task_instance': <airflow.models.taskinstance.SimpleTaskInstance object at 0x7fbd99eec760>, 'is_failure_callback': True} (See https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/tasks.html#zombie-undead-tasks)