Skip to content

Commit 9224271

Browse files
committed
Merge tag 'drm-habanalabs-next-2023-12-19' of https://git.kernel.org/pub/scm/linux/kernel/git/ogabbay/linux into drm-next
This tag contains habanalabs driver changes for v6.8. The notable changes are: - uAPI changes: - Add sysfs entry to allow users to identify a device minor id with its debugfs path - Add sysfs entry to expose the device's module id as given to us from the f/w - Add signed device information retrieval through the INFO ioctl - New features and improvements: - Update documentation of debugfs paths - Add support for Gaudi2C device (new PCI revision number) - Add pcie reset prepare/done hooks - Firmware related fixes and changes: - Print three instances version numbers of Infineon second stage - Assume hard-reset is done by f/w upon PCIe AXI drain - Bug fixes and code cleanups: - Fix information leak in sec_attest_info() - Avoid overriding existing undefined opcode data in Gaudi2 - Multiple Queue Manager (QMAN) fixes for Gaudi2 - Set hard reset flag if graceful reset is skipped - Remove 'get temperature' debug print - Fix the new Event Queue heartbeat mechanism Signed-off-by: Dave Airlie <airlied@redhat.com> From: Oded Gabbay <ogabbay@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/ZYFpihZscr/fsRRd@ogabbay-vm-u22.habana-labs.com
2 parents dc83fb6 + a9f0779 commit 9224271

File tree

16 files changed

+333
-184
lines changed

16 files changed

+333
-184
lines changed

Documentation/ABI/testing/debugfs-driver-habanalabs

Lines changed: 36 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
What: /sys/kernel/debug/accel/<n>/addr
1+
What: /sys/kernel/debug/accel/<parent_device>/addr
22
Date: Jan 2019
33
KernelVersion: 5.1
44
Contact: ogabbay@kernel.org
@@ -8,34 +8,34 @@ Description: Sets the device address to be used for read or write through
88
only when the IOMMU is disabled.
99
The acceptable value is a string that starts with "0x"
1010

11-
What: /sys/kernel/debug/accel/<n>/clk_gate
11+
What: /sys/kernel/debug/accel/<parent_device>/clk_gate
1212
Date: May 2020
1313
KernelVersion: 5.8
1414
Contact: ogabbay@kernel.org
1515
Description: This setting is now deprecated as clock gating is handled solely by the f/w
1616

17-
What: /sys/kernel/debug/accel/<n>/command_buffers
17+
What: /sys/kernel/debug/accel/<parent_device>/command_buffers
1818
Date: Jan 2019
1919
KernelVersion: 5.1
2020
Contact: ogabbay@kernel.org
2121
Description: Displays a list with information about the currently allocated
2222
command buffers
2323

24-
What: /sys/kernel/debug/accel/<n>/command_submission
24+
What: /sys/kernel/debug/accel/<parent_device>/command_submission
2525
Date: Jan 2019
2626
KernelVersion: 5.1
2727
Contact: ogabbay@kernel.org
2828
Description: Displays a list with information about the currently active
2929
command submissions
3030

31-
What: /sys/kernel/debug/accel/<n>/command_submission_jobs
31+
What: /sys/kernel/debug/accel/<parent_device>/command_submission_jobs
3232
Date: Jan 2019
3333
KernelVersion: 5.1
3434
Contact: ogabbay@kernel.org
3535
Description: Displays a list with detailed information about each JOB (CB) of
3636
each active command submission
3737

38-
What: /sys/kernel/debug/accel/<n>/data32
38+
What: /sys/kernel/debug/accel/<parent_device>/data32
3939
Date: Jan 2019
4040
KernelVersion: 5.1
4141
Contact: ogabbay@kernel.org
@@ -50,7 +50,7 @@ Description: Allows the root user to read or write directly through the
5050
If the IOMMU is disabled, it also allows the root user to read
5151
or write from the host a device VA of a host mapped memory
5252

53-
What: /sys/kernel/debug/accel/<n>/data64
53+
What: /sys/kernel/debug/accel/<parent_device>/data64
5454
Date: Jan 2020
5555
KernelVersion: 5.6
5656
Contact: ogabbay@kernel.org
@@ -65,7 +65,7 @@ Description: Allows the root user to read or write 64 bit data directly
6565
If the IOMMU is disabled, it also allows the root user to read
6666
or write from the host a device VA of a host mapped memory
6767

68-
What: /sys/kernel/debug/accel/<n>/data_dma
68+
What: /sys/kernel/debug/accel/<parent_device>/data_dma
6969
Date: Apr 2021
7070
KernelVersion: 5.13
7171
Contact: ogabbay@kernel.org
@@ -83,22 +83,22 @@ Description: Allows the root user to read from the device's internal
8383
workloads.
8484
Only supported on GAUDI at this stage.
8585

86-
What: /sys/kernel/debug/accel/<n>/device
86+
What: /sys/kernel/debug/accel/<parent_device>/device
8787
Date: Jan 2019
8888
KernelVersion: 5.1
8989
Contact: ogabbay@kernel.org
9090
Description: Enables the root user to set the device to specific state.
9191
Valid values are "disable", "enable", "suspend", "resume".
9292
User can read this property to see the valid values
9393

94-
What: /sys/kernel/debug/accel/<n>/device_release_watchdog_timeout
94+
What: /sys/kernel/debug/accel/<parent_device>/device_release_watchdog_timeout
9595
Date: Oct 2022
9696
KernelVersion: 6.2
9797
Contact: ttayar@habana.ai
9898
Description: The watchdog timeout value in seconds for a device release upon
9999
certain error cases, after which the device is reset.
100100

101-
What: /sys/kernel/debug/accel/<n>/dma_size
101+
What: /sys/kernel/debug/accel/<parent_device>/dma_size
102102
Date: Apr 2021
103103
KernelVersion: 5.13
104104
Contact: ogabbay@kernel.org
@@ -108,7 +108,7 @@ Description: Specify the size of the DMA transaction when using DMA to read
108108
When the write is finished, the user can read the "data_dma"
109109
blob
110110

111-
What: /sys/kernel/debug/accel/<n>/dump_razwi_events
111+
What: /sys/kernel/debug/accel/<parent_device>/dump_razwi_events
112112
Date: Aug 2022
113113
KernelVersion: 5.20
114114
Contact: fkassabri@habana.ai
@@ -117,38 +117,38 @@ Description: Dumps all razwi events to dmesg if exist.
117117
the routine will clear the status register.
118118
Usage: cat dump_razwi_events
119119

120-
What: /sys/kernel/debug/accel/<n>/dump_security_violations
120+
What: /sys/kernel/debug/accel/<parent_device>/dump_security_violations
121121
Date: Jan 2021
122122
KernelVersion: 5.12
123123
Contact: ogabbay@kernel.org
124124
Description: Dumps all security violations to dmesg. This will also ack
125125
all security violations meanings those violations will not be
126126
dumped next time user calls this API
127127

128-
What: /sys/kernel/debug/accel/<n>/engines
128+
What: /sys/kernel/debug/accel/<parent_device>/engines
129129
Date: Jul 2019
130130
KernelVersion: 5.3
131131
Contact: ogabbay@kernel.org
132132
Description: Displays the status registers values of the device engines and
133133
their derived idle status
134134

135-
What: /sys/kernel/debug/accel/<n>/i2c_addr
135+
What: /sys/kernel/debug/accel/<parent_device>/i2c_addr
136136
Date: Jan 2019
137137
KernelVersion: 5.1
138138
Contact: ogabbay@kernel.org
139139
Description: Sets I2C device address for I2C transaction that is generated
140140
by the device's CPU, Not available when device is loaded with secured
141141
firmware
142142

143-
What: /sys/kernel/debug/accel/<n>/i2c_bus
143+
What: /sys/kernel/debug/accel/<parent_device>/i2c_bus
144144
Date: Jan 2019
145145
KernelVersion: 5.1
146146
Contact: ogabbay@kernel.org
147147
Description: Sets I2C bus address for I2C transaction that is generated by
148148
the device's CPU, Not available when device is loaded with secured
149149
firmware
150150

151-
What: /sys/kernel/debug/accel/<n>/i2c_data
151+
What: /sys/kernel/debug/accel/<parent_device>/i2c_data
152152
Date: Jan 2019
153153
KernelVersion: 5.1
154154
Contact: ogabbay@kernel.org
@@ -157,59 +157,59 @@ Description: Triggers an I2C transaction that is generated by the device's
157157
reading from the file generates a read transaction, Not available
158158
when device is loaded with secured firmware
159159

160-
What: /sys/kernel/debug/accel/<n>/i2c_len
160+
What: /sys/kernel/debug/accel/<parent_device>/i2c_len
161161
Date: Dec 2021
162162
KernelVersion: 5.17
163163
Contact: obitton@habana.ai
164164
Description: Sets I2C length in bytes for I2C transaction that is generated by
165165
the device's CPU, Not available when device is loaded with secured
166166
firmware
167167

168-
What: /sys/kernel/debug/accel/<n>/i2c_reg
168+
What: /sys/kernel/debug/accel/<parent_device>/i2c_reg
169169
Date: Jan 2019
170170
KernelVersion: 5.1
171171
Contact: ogabbay@kernel.org
172172
Description: Sets I2C register id for I2C transaction that is generated by
173173
the device's CPU, Not available when device is loaded with secured
174174
firmware
175175

176-
What: /sys/kernel/debug/accel/<n>/led0
176+
What: /sys/kernel/debug/accel/<parent_device>/led0
177177
Date: Jan 2019
178178
KernelVersion: 5.1
179179
Contact: ogabbay@kernel.org
180180
Description: Sets the state of the first S/W led on the device, Not available
181181
when device is loaded with secured firmware
182182

183-
What: /sys/kernel/debug/accel/<n>/led1
183+
What: /sys/kernel/debug/accel/<parent_device>/led1
184184
Date: Jan 2019
185185
KernelVersion: 5.1
186186
Contact: ogabbay@kernel.org
187187
Description: Sets the state of the second S/W led on the device, Not available
188188
when device is loaded with secured firmware
189189

190-
What: /sys/kernel/debug/accel/<n>/led2
190+
What: /sys/kernel/debug/accel/<parent_device>/led2
191191
Date: Jan 2019
192192
KernelVersion: 5.1
193193
Contact: ogabbay@kernel.org
194194
Description: Sets the state of the third S/W led on the device, Not available
195195
when device is loaded with secured firmware
196196

197-
What: /sys/kernel/debug/accel/<n>/memory_scrub
197+
What: /sys/kernel/debug/accel/<parent_device>/memory_scrub
198198
Date: May 2022
199199
KernelVersion: 5.19
200200
Contact: dhirschfeld@habana.ai
201201
Description: Allows the root user to scrub the dram memory. The scrubbing
202202
value can be set using the debugfs file memory_scrub_val.
203203

204-
What: /sys/kernel/debug/accel/<n>/memory_scrub_val
204+
What: /sys/kernel/debug/accel/<parent_device>/memory_scrub_val
205205
Date: May 2022
206206
KernelVersion: 5.19
207207
Contact: dhirschfeld@habana.ai
208208
Description: The value to which the dram will be set to when the user
209209
scrubs the dram using 'memory_scrub' debugfs file and
210210
the scrubbing value when using module param 'memory_scrub'
211211

212-
What: /sys/kernel/debug/accel/<n>/mmu
212+
What: /sys/kernel/debug/accel/<parent_device>/mmu
213213
Date: Jan 2019
214214
KernelVersion: 5.1
215215
Contact: ogabbay@kernel.org
@@ -219,7 +219,7 @@ Description: Displays the hop values and physical address for a given ASID
219219
e.g. to display info about VA 0x1000 for ASID 1 you need to do:
220220
echo "1 0x1000" > /sys/kernel/debug/accel/0/mmu
221221

222-
What: /sys/kernel/debug/accel/<n>/mmu_error
222+
What: /sys/kernel/debug/accel/<parent_device>/mmu_error
223223
Date: Mar 2021
224224
KernelVersion: 5.12
225225
Contact: fkassabri@habana.ai
@@ -229,7 +229,7 @@ Description: Check and display page fault or access violation mmu errors for
229229
echo "0x200" > /sys/kernel/debug/accel/0/mmu_error
230230
cat /sys/kernel/debug/accel/0/mmu_error
231231

232-
What: /sys/kernel/debug/accel/<n>/monitor_dump
232+
What: /sys/kernel/debug/accel/<parent_device>/monitor_dump
233233
Date: Mar 2022
234234
KernelVersion: 5.19
235235
Contact: osharabi@habana.ai
@@ -243,7 +243,7 @@ Description: Allows the root user to dump monitors status from the device's
243243
This interface doesn't support concurrency in the same device.
244244
Only supported on GAUDI.
245245

246-
What: /sys/kernel/debug/accel/<n>/monitor_dump_trig
246+
What: /sys/kernel/debug/accel/<parent_device>/monitor_dump_trig
247247
Date: Mar 2022
248248
KernelVersion: 5.19
249249
Contact: osharabi@habana.ai
@@ -253,22 +253,22 @@ Description: Triggers dump of monitor data. The value to trigger the operatio
253253
When the write is finished, the user can read the "monitor_dump"
254254
blob
255255

256-
What: /sys/kernel/debug/accel/<n>/set_power_state
256+
What: /sys/kernel/debug/accel/<parent_device>/set_power_state
257257
Date: Jan 2019
258258
KernelVersion: 5.1
259259
Contact: ogabbay@kernel.org
260260
Description: Sets the PCI power state. Valid values are "1" for D0 and "2"
261261
for D3Hot
262262

263-
What: /sys/kernel/debug/accel/<n>/skip_reset_on_timeout
263+
What: /sys/kernel/debug/accel/<parent_device>/skip_reset_on_timeout
264264
Date: Jun 2021
265265
KernelVersion: 5.13
266266
Contact: ynudelman@habana.ai
267267
Description: Sets the skip reset on timeout option for the device. Value of
268268
"0" means device will be reset in case some CS has timed out,
269269
otherwise it will not be reset.
270270

271-
What: /sys/kernel/debug/accel/<n>/state_dump
271+
What: /sys/kernel/debug/accel/<parent_device>/state_dump
272272
Date: Oct 2021
273273
KernelVersion: 5.15
274274
Contact: ynudelman@habana.ai
@@ -279,37 +279,37 @@ Description: Gets the state dump occurring on a CS timeout or failure.
279279
Writing an integer X discards X state dumps, so that the
280280
next read would return X+1-st newest state dump.
281281

282-
What: /sys/kernel/debug/accel/<n>/stop_on_err
282+
What: /sys/kernel/debug/accel/<parent_device>/stop_on_err
283283
Date: Mar 2020
284284
KernelVersion: 5.6
285285
Contact: ogabbay@kernel.org
286286
Description: Sets the stop-on_error option for the device engines. Value of
287287
"0" is for disable, otherwise enable.
288288
Relevant only for GOYA and GAUDI.
289289

290-
What: /sys/kernel/debug/accel/<n>/timeout_locked
290+
What: /sys/kernel/debug/accel/<parent_device>/timeout_locked
291291
Date: Sep 2021
292292
KernelVersion: 5.16
293293
Contact: obitton@habana.ai
294294
Description: Sets the command submission timeout value in seconds.
295295

296-
What: /sys/kernel/debug/accel/<n>/userptr
296+
What: /sys/kernel/debug/accel/<parent_device>/userptr
297297
Date: Jan 2019
298298
KernelVersion: 5.1
299299
Contact: ogabbay@kernel.org
300300
Description: Displays a list with information about the current user
301301
pointers (user virtual addresses) that are pinned and mapped
302302
to DMA addresses
303303

304-
What: /sys/kernel/debug/accel/<n>/userptr_lookup
304+
What: /sys/kernel/debug/accel/<parent_device>/userptr_lookup
305305
Date: Oct 2021
306306
KernelVersion: 5.15
307307
Contact: ogabbay@kernel.org
308308
Description: Allows to search for specific user pointers (user virtual
309309
addresses) that are pinned and mapped to DMA addresses, and see
310310
their resolution to the specific dma address.
311311

312-
What: /sys/kernel/debug/accel/<n>/vm
312+
What: /sys/kernel/debug/accel/<parent_device>/vm
313313
Date: Jan 2019
314314
KernelVersion: 5.1
315315
Contact: ogabbay@kernel.org

Documentation/ABI/testing/sysfs-driver-habanalabs

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -149,6 +149,18 @@ Contact: ogabbay@kernel.org
149149
Description: Displays the current clock frequency, in Hz, of the MME compute
150150
engine. This property is valid only for the Goya ASIC family
151151

152+
What: /sys/class/accel/accel<n>/device/module_id
153+
Date: Nov 2023
154+
KernelVersion: not yet upstreamed
155+
Contact: ogabbay@kernel.org
156+
Description: Displays the device's module id
157+
158+
What: /sys/class/accel/accel<n>/device/parent_device
159+
Date: Nov 2023
160+
KernelVersion: 6.8
161+
Contact: ttayar@habana.ai
162+
Description: Displays the name of the parent device of the accel device
163+
152164
What: /sys/class/accel/accel<n>/device/pci_addr
153165
Date: Jan 2019
154166
KernelVersion: 5.1

drivers/accel/habanalabs/common/device.c

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -853,6 +853,9 @@ static int device_early_init(struct hl_device *hdev)
853853
gaudi2_set_asic_funcs(hdev);
854854
strscpy(hdev->asic_name, "GAUDI2B", sizeof(hdev->asic_name));
855855
break;
856+
case ASIC_GAUDI2C:
857+
gaudi2_set_asic_funcs(hdev);
858+
strscpy(hdev->asic_name, "GAUDI2C", sizeof(hdev->asic_name));
856859
break;
857860
default:
858861
dev_err(hdev->dev, "Unrecognized ASIC type %d\n",
@@ -1041,18 +1044,21 @@ static bool is_pci_link_healthy(struct hl_device *hdev)
10411044
return (vendor_id == PCI_VENDOR_ID_HABANALABS);
10421045
}
10431046

1044-
static void hl_device_eq_heartbeat(struct hl_device *hdev)
1047+
static int hl_device_eq_heartbeat_check(struct hl_device *hdev)
10451048
{
1046-
u64 event_mask = HL_NOTIFIER_EVENT_DEVICE_RESET | HL_NOTIFIER_EVENT_DEVICE_UNAVAILABLE;
10471049
struct asic_fixed_properties *prop = &hdev->asic_prop;
10481050

10491051
if (!prop->cpucp_info.eq_health_check_supported)
1050-
return;
1052+
return 0;
10511053

1052-
if (hdev->eq_heartbeat_received)
1054+
if (hdev->eq_heartbeat_received) {
10531055
hdev->eq_heartbeat_received = false;
1054-
else
1055-
hl_device_cond_reset(hdev, HL_DRV_RESET_HARD, event_mask);
1056+
} else {
1057+
dev_err(hdev->dev, "EQ heartbeat event was not received!\n");
1058+
return -EIO;
1059+
}
1060+
1061+
return 0;
10561062
}
10571063

10581064
static void hl_device_heartbeat(struct work_struct *work)
@@ -1069,10 +1075,9 @@ static void hl_device_heartbeat(struct work_struct *work)
10691075
/*
10701076
* For EQ health check need to check if driver received the heartbeat eq event
10711077
* in order to validate the eq is working.
1078+
* Only if both the EQ is healthy and we managed to send the next heartbeat reschedule.
10721079
*/
1073-
hl_device_eq_heartbeat(hdev);
1074-
1075-
if (!hdev->asic_funcs->send_heartbeat(hdev))
1080+
if ((!hl_device_eq_heartbeat_check(hdev)) && (!hdev->asic_funcs->send_heartbeat(hdev)))
10761081
goto reschedule;
10771082

10781083
if (hl_device_operational(hdev, NULL))
@@ -2035,7 +2040,7 @@ int hl_device_cond_reset(struct hl_device *hdev, u32 flags, u64 event_mask)
20352040
if (ctx)
20362041
hl_ctx_put(ctx);
20372042

2038-
return hl_device_reset(hdev, flags);
2043+
return hl_device_reset(hdev, flags | HL_DRV_RESET_HARD);
20392044
}
20402045

20412046
static void hl_notifier_event_send(struct hl_notifier_event *notifier_event, u64 event_mask)

0 commit comments

Comments
 (0)