Skip to content

[BUG]deepspeed/ops/transformer/inference/triton/matmul_ext.py -> df: /root/.triton/autotune: No such file or directory #7642

@joe0731

Description

@joe0731

Describe the bug
When executing operations related to deepspeed/ops/transformer/inference/triton/matmul_ext.py (e.g., running grep is_nfs_path on this file) or launching DeepSpeed-based tasks, an error is thrown: df: /root/.triton/autotune: No such file or directory. This suggests the script is attempting to check the existence or properties of the /root/.triton/autotune path, but the directory does not exist, causing a failure in the file system check (likely via df command).

To Reproduce
Steps to reproduce the behavior:

  1. Navigate to the DeepSpeed installation directory, e.g., cd /usr/local/lib/python3.12/dist-packages/deepspeed/ops/transformer/inference/triton/
  2. Run grep is_nfs_path matmul_ext.py to search for the relevant code
  3. Alternatively, launch a DeepSpeed inference task that triggers the matmul_ext.py module (e.g., using Triton-based matrix multiplication)
  4. Observe the error message: df: /root/.triton/autotune: No such file or directory

Expected behavior
The operation (either grep or the DeepSpeed task) should complete without errors. If the /root/.triton/autotune directory is required for Triton's autotuning functionality, the script should either automatically create it or handle its absence gracefully without throwing a df command error.

ds_report output

[Please run `ds_report` and paste the output here. Example:  
--------------------------------------------------  
DeepSpeed C++/CUDA extension op report  
--------------------------------------------------  
NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed.  
--------------------------------------------------  
DeepSpeed general environment info:  
torch version ............... 2.1.0+cu118  
torch cuda version ........... 11.8  
torch hip version ............ None  
nvcc version ................. 11.8  
deepspeed version ............ 0.10.0  
deepspeed wheel compiled w. ... torch 2.1, cuda 11.8  
--------------------------------------------------  
...  
]  

Screenshots
No screenshots applicable; the error is text-based as described.

System info (please complete the following information):

  • OS: Ubuntu 22.04 (or relevant version)
  • GPU count and types: [e.g., 1x NVIDIA A100, or specify your GPU model]
  • Interconnects (if applicable): None (single machine)
  • Python version: 3.12.x
  • Additional info: DeepSpeed installed via pip in a system-wide Python 3.12 environment.

Launcher context
Launching experiments with the deepspeed launcher (e.g., deepspeed --num_gpus=1 my_script.py).

Docker context
Not using Docker; running directly on the host system. (If using Docker, specify the image: e.g., nvcr.io/nvidia/pytorch:23.10-py3)

Additional context
The error appears to stem from code in matmul_ext.py that checks if /root/.triton/autotune is on an NFS filesystem (via df command), but the directory does not exist. Manually creating the directory (mkdir -p /root/.triton/autotune) temporarily resolves the error, suggesting the script assumes this path exists but does not create it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions