Skip to content

Commit fe05cee

Browse files
mbrost05lucasdemarchi
authored andcommitted
drm/xe: Don't short circuit TDR on jobs not started
Short circuiting TDR on jobs not started is an optimization which is not required. On LNL we are facing an issue where jobs do not get scheduled by the GuC if it misses a GGTT page update. When this occurs let the TDR fire, toggle the scheduling which may get the job unstuck, and print a warning message. If the TDR fires twice on job that hasn't started, timeout the job. v2: - Add warning message (Paulo) - Add fixes tag (Paulo) - Timeout job which hasn't started after TDR firing twice v3: - Include local change v4: - Short circuit check_timeout on job not started - use warn level rather than notice (Paulo) Fixes: 7ddb940 ("drm/xe: Sample ctx timestamp to determine if jobs have timed out") Cc: stable@vger.kernel.org Cc: Paulo Zanoni <paulo.r.zanoni@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241025214330.2010521-2-matthew.brost@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit 35d25a4) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
1 parent 993ca0e commit fe05cee

File tree

1 file changed

+12
-6
lines changed

1 file changed

+12
-6
lines changed

drivers/gpu/drm/xe/xe_guc_submit.c

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -916,12 +916,22 @@ static void xe_guc_exec_queue_lr_cleanup(struct work_struct *w)
916916
static bool check_timeout(struct xe_exec_queue *q, struct xe_sched_job *job)
917917
{
918918
struct xe_gt *gt = guc_to_gt(exec_queue_to_guc(q));
919-
u32 ctx_timestamp = xe_lrc_ctx_timestamp(q->lrc[0]);
920-
u32 ctx_job_timestamp = xe_lrc_ctx_job_timestamp(q->lrc[0]);
919+
u32 ctx_timestamp, ctx_job_timestamp;
921920
u32 timeout_ms = q->sched_props.job_timeout_ms;
922921
u32 diff;
923922
u64 running_time_ms;
924923

924+
if (!xe_sched_job_started(job)) {
925+
xe_gt_warn(gt, "Check job timeout: seqno=%u, lrc_seqno=%u, guc_id=%d, not started",
926+
xe_sched_job_seqno(job), xe_sched_job_lrc_seqno(job),
927+
q->guc->id);
928+
929+
return xe_sched_invalidate_job(job, 2);
930+
}
931+
932+
ctx_timestamp = xe_lrc_ctx_timestamp(q->lrc[0]);
933+
ctx_job_timestamp = xe_lrc_ctx_job_timestamp(q->lrc[0]);
934+
925935
/*
926936
* Counter wraps at ~223s at the usual 19.2MHz, be paranoid catch
927937
* possible overflows with a high timeout.
@@ -1049,10 +1059,6 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
10491059
exec_queue_killed_or_banned_or_wedged(q) ||
10501060
exec_queue_destroyed(q);
10511061

1052-
/* Job hasn't started, can't be timed out */
1053-
if (!skip_timeout_check && !xe_sched_job_started(job))
1054-
goto rearm;
1055-
10561062
/*
10571063
* XXX: Sampling timeout doesn't work in wedged mode as we have to
10581064
* modify scheduling state to read timestamp. We could read the

0 commit comments

Comments
 (0)