You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I created a static dataset on a ZFS pool (zpool) comprised of three NVMe SSDs (no raid). The dataset contains ~6000 files (each ~300 MiB) used for AI training workloads.
During training, processes intermittently fail with "File not found" errors on randomly accessed files. This occurs in ~20% of training jobs (1 in 5). After the error:
The reported file is confirmed to exist and can be read immediately afterward.
zpool status shows no errors, checksum errors, or pool degradation.
The issue is not consistently reproducible but severely impacts workflow reliability.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
I created a static dataset on a ZFS pool (zpool) comprised of three NVMe SSDs (no raid). The dataset contains ~6000 files (each ~300 MiB) used for AI training workloads.
During training, processes intermittently fail with "File not found" errors on randomly accessed files. This occurs in ~20% of training jobs (1 in 5). After the error:
The issue is not consistently reproducible but severely impacts workflow reliability.
Environment
ZFS Version:
zfs-2.3.1-1
zfs-kmod-2.3.1-1
OS: [6.14.8-zabbly+ #ubuntu22.04]
Hardware: NVMe SSDs (Model: DAPUSTOR DPHV510),
Pool Configuration:
any suggestion on how I can diagnose the issue?
Beta Was this translation helpful? Give feedback.
All reactions