Skip to content

Commit ade5add

Browse files
committed
Merge tag 'amd-drm-next-6.13-2024-11-15' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-6.13-2024-11-15: amdgpu: - Parition fixes - GFX 12 fixes - SR-IOV fixes - MES fixes - RAS fixes - GC queue handling fixes - VCN fixes - Add sysfs reset masks - Better error messages for P2P failurs - SMU fixes - Documentation updates - GFX11 enforce isolation updates - Display HPD fixes - PSR fixes - Panel replay fixes - DP MST fixes - USB4 fixes - Misc display fixes and cleanups - VRAM handling fix for APUs - NBIO fix amdkfd: - INIT_WORK fix - Refcount fix - KFD MES scheduling fixes drm/fourcc: - Add missing tiling mode Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241115165012.573465-1-alexander.deucher@amd.com
2 parents 56b70bf + 447a54a commit ade5add

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

78 files changed

+1452
-243
lines changed

Documentation/gpu/amdgpu/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,4 +16,5 @@ Next (GCN), Radeon DNA (RDNA), and Compute DNA (CDNA) architectures.
1616
thermal
1717
driver-misc
1818
debugging
19+
process-isolation
1920
amdgpu-glossary
Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
=========================
4+
AMDGPU Process Isolation
5+
=========================
6+
7+
The AMDGPU driver includes a feature that enables automatic process isolation on the graphics engine. This feature serializes access to the graphics engine and adds a cleaner shader which clears the Local Data Store (LDS) and General Purpose Registers (GPRs) between jobs. All processes using the GPU, including both graphics and compute workloads, are serialized when this feature is enabled. On GPUs that support partitionable graphics engines, this feature can be enabled on a per-partition basis.
8+
9+
In addition, there is an interface to manually run the cleaner shader when the use of the GPU is complete. This may be preferable in some use cases, such as a single-user system where the login manager triggers the cleaner shader when the user logs out.
10+
11+
Process Isolation
12+
=================
13+
14+
The `run_cleaner_shader` and `enforce_isolation` sysfs interfaces allow users to manually execute the cleaner shader and control the process isolation feature, respectively.
15+
16+
Partition Handling
17+
------------------
18+
19+
The `enforce_isolation` file in sysfs can be used to enable process isolation and automatic shader cleanup between processes. On GPUs that support graphics engine partitioning, this can be enabled per partition. The partition and its current setting (0 disabled, 1 enabled) can be read from sysfs. On GPUs that do not support graphics engine partitioning, only a single partition will be present. Writing 1 to the partition position enables enforce isolation, writing 0 disables it.
20+
21+
Example of enabling enforce isolation on a GPU with multiple partitions:
22+
23+
.. code-block:: console
24+
25+
$ echo 1 0 1 0 > /sys/class/drm/card0/device/enforce_isolation
26+
$ cat /sys/class/drm/card0/device/enforce_isolation
27+
1 0 1 0
28+
29+
The output indicates that enforce isolation is enabled on zeroth and second parition and disabled on first and fourth parition.
30+
31+
For devices with a single partition or those that do not support partitions, there will be only one element:
32+
33+
.. code-block:: console
34+
35+
$ echo 1 > /sys/class/drm/card0/device/enforce_isolation
36+
$ cat /sys/class/drm/card0/device/enforce_isolation
37+
1
38+
39+
Cleaner Shader Execution
40+
========================
41+
42+
The driver can trigger a cleaner shader to clean up the LDS and GPR state on the graphics engine. When process isolation is enabled, this happens automatically between processes. In addition, there is a sysfs file to manually trigger cleaner shader execution.
43+
44+
To manually trigger the execution of the cleaner shader, write `0` to the `run_cleaner_shader` sysfs file:
45+
46+
.. code-block:: console
47+
48+
$ echo 0 > /sys/class/drm/card0/device/run_cleaner_shader
49+
50+
For multi-partition devices, you can specify the partition index when triggering the cleaner shader:
51+
52+
.. code-block:: console
53+
54+
$ echo 0 > /sys/class/drm/card0/device/run_cleaner_shader # For partition 0
55+
$ echo 1 > /sys/class/drm/card0/device/run_cleaner_shader # For partition 1
56+
$ echo 2 > /sys/class/drm/card0/device/run_cleaner_shader # For partition 2
57+
# ... and so on for each partition
58+
59+
This command initiates the cleaner shader, which will run and complete before any new tasks are scheduled on the GPU.

drivers/gpu/drm/amd/amdgpu/amdgpu.h

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -299,6 +299,12 @@ extern int amdgpu_wbrf;
299299
#define AMDGPU_RESET_VCE (1 << 13)
300300
#define AMDGPU_RESET_VCE1 (1 << 14)
301301

302+
/* reset mask */
303+
#define AMDGPU_RESET_TYPE_FULL (1 << 0) /* full adapter reset, mode1/mode2/BACO/etc. */
304+
#define AMDGPU_RESET_TYPE_SOFT_RESET (1 << 1) /* IP level soft reset */
305+
#define AMDGPU_RESET_TYPE_PER_QUEUE (1 << 2) /* per queue */
306+
#define AMDGPU_RESET_TYPE_PER_PIPE (1 << 3) /* per pipe */
307+
302308
/* max cursor sizes (in pixels) */
303309
#define CIK_CURSOR_WIDTH 128
304310
#define CIK_CURSOR_HEIGHT 128
@@ -1464,6 +1470,8 @@ struct dma_fence *amdgpu_device_get_gang(struct amdgpu_device *adev);
14641470
struct dma_fence *amdgpu_device_switch_gang(struct amdgpu_device *adev,
14651471
struct dma_fence *gang);
14661472
bool amdgpu_device_has_display_hardware(struct amdgpu_device *adev);
1473+
ssize_t amdgpu_get_soft_full_reset_mask(struct amdgpu_ring *ring);
1474+
ssize_t amdgpu_show_reset_mask(char *buf, uint32_t supported_reset);
14671475

14681476
/* atpx handler */
14691477
#if defined(CONFIG_VGA_SWITCHEROO)

drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -158,7 +158,7 @@ static int aca_smu_get_valid_aca_banks(struct amdgpu_device *adev, enum aca_smu_
158158
return -EINVAL;
159159
}
160160

161-
if (start + count >= max_count)
161+
if (start + count > max_count)
162162
return -EINVAL;
163163

164164
count = min_t(int, count, max_count);

drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -834,6 +834,9 @@ int amdgpu_amdkfd_unmap_hiq(struct amdgpu_device *adev, u32 doorbell_off,
834834
if (!kiq->pmf || !kiq->pmf->kiq_unmap_queues)
835835
return -EINVAL;
836836

837+
if (!kiq_ring->sched.ready || adev->job_hang)
838+
return 0;
839+
837840
ring_funcs = kzalloc(sizeof(*ring_funcs), GFP_KERNEL);
838841
if (!ring_funcs)
839842
return -ENOMEM;
@@ -858,8 +861,14 @@ int amdgpu_amdkfd_unmap_hiq(struct amdgpu_device *adev, u32 doorbell_off,
858861

859862
kiq->pmf->kiq_unmap_queues(kiq_ring, ring, RESET_QUEUES, 0, 0);
860863

861-
if (kiq_ring->sched.ready && !adev->job_hang)
862-
r = amdgpu_ring_test_helper(kiq_ring);
864+
/* Submit unmap queue packet */
865+
amdgpu_ring_commit(kiq_ring);
866+
/*
867+
* Ring test will do a basic scratch register change check. Just run
868+
* this to ensure that unmap queues that is submitted before got
869+
* processed successfully before returning.
870+
*/
871+
r = amdgpu_ring_test_helper(kiq_ring);
863872

864873
spin_unlock(&kiq->ring_lock);
865874

drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4236,7 +4236,10 @@ int amdgpu_device_init(struct amdgpu_device *adev,
42364236
* for throttling interrupt) = 60 seconds.
42374237
*/
42384238
ratelimit_state_init(&adev->throttling_logging_rs, (60 - 1) * HZ, 1);
4239+
ratelimit_state_init(&adev->virt.ras_telemetry_rs, 5 * HZ, 1);
4240+
42394241
ratelimit_set_flags(&adev->throttling_logging_rs, RATELIMIT_MSG_ON_RELEASE);
4242+
ratelimit_set_flags(&adev->virt.ras_telemetry_rs, RATELIMIT_MSG_ON_RELEASE);
42404243

42414244
/* Registers mapping */
42424245
/* TODO: block userspace mapping of io register */
@@ -5186,6 +5189,9 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
51865189
amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(9, 4, 4) ||
51875190
amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(11, 0, 3))
51885191
amdgpu_ras_resume(adev);
5192+
5193+
amdgpu_virt_ras_telemetry_post_reset(adev);
5194+
51895195
return 0;
51905196
}
51915197

@@ -6200,6 +6206,9 @@ bool amdgpu_device_is_peer_accessible(struct amdgpu_device *adev,
62006206
bool p2p_access =
62016207
!adev->gmc.xgmi.connected_to_cpu &&
62026208
!(pci_p2pdma_distance(adev->pdev, peer_adev->dev, false) < 0);
6209+
if (!p2p_access)
6210+
dev_info(adev->dev, "PCIe P2P access from peer device %s is not supported by the chipset\n",
6211+
pci_name(peer_adev->pdev));
62036212

62046213
bool is_large_bar = adev->gmc.visible_vram_size &&
62056214
adev->gmc.real_vram_size == adev->gmc.visible_vram_size;
@@ -6715,3 +6724,47 @@ uint32_t amdgpu_device_wait_on_rreg(struct amdgpu_device *adev,
67156724
}
67166725
return ret;
67176726
}
6727+
6728+
ssize_t amdgpu_get_soft_full_reset_mask(struct amdgpu_ring *ring)
6729+
{
6730+
ssize_t size = 0;
6731+
6732+
if (!ring || !ring->adev)
6733+
return size;
6734+
6735+
if (amdgpu_device_should_recover_gpu(ring->adev))
6736+
size |= AMDGPU_RESET_TYPE_FULL;
6737+
6738+
if (unlikely(!ring->adev->debug_disable_soft_recovery) &&
6739+
!amdgpu_sriov_vf(ring->adev) && ring->funcs->soft_recovery)
6740+
size |= AMDGPU_RESET_TYPE_SOFT_RESET;
6741+
6742+
return size;
6743+
}
6744+
6745+
ssize_t amdgpu_show_reset_mask(char *buf, uint32_t supported_reset)
6746+
{
6747+
ssize_t size = 0;
6748+
6749+
if (supported_reset == 0) {
6750+
size += sysfs_emit_at(buf, size, "unsupported");
6751+
size += sysfs_emit_at(buf, size, "\n");
6752+
return size;
6753+
6754+
}
6755+
6756+
if (supported_reset & AMDGPU_RESET_TYPE_SOFT_RESET)
6757+
size += sysfs_emit_at(buf, size, "soft ");
6758+
6759+
if (supported_reset & AMDGPU_RESET_TYPE_PER_QUEUE)
6760+
size += sysfs_emit_at(buf, size, "queue ");
6761+
6762+
if (supported_reset & AMDGPU_RESET_TYPE_PER_PIPE)
6763+
size += sysfs_emit_at(buf, size, "pipe ");
6764+
6765+
if (supported_reset & AMDGPU_RESET_TYPE_FULL)
6766+
size += sysfs_emit_at(buf, size, "full ");
6767+
6768+
size += sysfs_emit_at(buf, size, "\n");
6769+
return size;
6770+
}

0 commit comments

Comments
 (0)