-
Couldn't load subscription status.
- Fork 4.6k
Description
Describe the bug
When executing operations related to deepspeed/ops/transformer/inference/triton/matmul_ext.py (e.g., running grep is_nfs_path on this file) or launching DeepSpeed-based tasks, an error is thrown: df: /root/.triton/autotune: No such file or directory. This suggests the script is attempting to check the existence or properties of the /root/.triton/autotune path, but the directory does not exist, causing a failure in the file system check (likely via df command).
To Reproduce
Steps to reproduce the behavior:
- Navigate to the DeepSpeed installation directory, e.g.,
cd /usr/local/lib/python3.12/dist-packages/deepspeed/ops/transformer/inference/triton/ - Run
grep is_nfs_path matmul_ext.pyto search for the relevant code - Alternatively, launch a DeepSpeed inference task that triggers the
matmul_ext.pymodule (e.g., using Triton-based matrix multiplication) - Observe the error message:
df: /root/.triton/autotune: No such file or directory
Expected behavior
The operation (either grep or the DeepSpeed task) should complete without errors. If the /root/.triton/autotune directory is required for Triton's autotuning functionality, the script should either automatically create it or handle its absence gracefully without throwing a df command error.
ds_report output
[Please run `ds_report` and paste the output here. Example:
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed.
--------------------------------------------------
DeepSpeed general environment info:
torch version ............... 2.1.0+cu118
torch cuda version ........... 11.8
torch hip version ............ None
nvcc version ................. 11.8
deepspeed version ............ 0.10.0
deepspeed wheel compiled w. ... torch 2.1, cuda 11.8
--------------------------------------------------
...
]
Screenshots
No screenshots applicable; the error is text-based as described.
System info (please complete the following information):
- OS: Ubuntu 22.04 (or relevant version)
- GPU count and types: [e.g., 1x NVIDIA A100, or specify your GPU model]
- Interconnects (if applicable): None (single machine)
- Python version: 3.12.x
- Additional info: DeepSpeed installed via
pipin a system-wide Python 3.12 environment.
Launcher context
Launching experiments with the deepspeed launcher (e.g., deepspeed --num_gpus=1 my_script.py).
Docker context
Not using Docker; running directly on the host system. (If using Docker, specify the image: e.g., nvcr.io/nvidia/pytorch:23.10-py3)
Additional context
The error appears to stem from code in matmul_ext.py that checks if /root/.triton/autotune is on an NFS filesystem (via df command), but the directory does not exist. Manually creating the directory (mkdir -p /root/.triton/autotune) temporarily resolves the error, suggesting the script assumes this path exists but does not create it.