Skip to content

[Out of Box Experience]: ROCm Transformer Engine Should Be Included in AMD Pytorch Images #82

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
functionstackx opened this issue Oct 18, 2024 · 6 comments

Comments

@functionstackx
Copy link
Contributor

functionstackx commented Oct 18, 2024

Suggestion Description

Hi @hliuca ,

On Nvidia NGC Pytorch Containers nvcr.io/nvidia/pytorch:24.xx-py3, Transformer Engine is included out of the box. This leads to less end user installation misconfiguration issues such as not using the correct build flags.

Currently on rocm/pytorch& on rocm/pytorch-nightly, Transformer Engine is not included out of the box.

It would be great to have parity with Nvidia on this.

https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-24-09.html

image

Operating System

Ubuntu

GPU

MI300X

ROCm Component

No response

@hliuca
Copy link

hliuca commented Oct 18, 2024

Thank you @OrenLeung again for the suggestion and support. Let me talk with our team about this.

@hliuca
Copy link

hliuca commented Nov 11, 2024

@OrenLeung we are working on this request to include TE in torch image. Thanks.

@functionstackx
Copy link
Contributor Author

@hliuca should we close this issue now that it is in rocm/pytorch-training or is this issue tracking te into rocm/pytorch?

@hliuca
Copy link

hliuca commented Mar 12, 2025

Right now, it is included in rocm/pytorch-training.

Internally we have ticket and try to include into torch image too, to fix this issue.

@gugarosa
Copy link

Is there any update on this issue? TransformerEngine included in rocm/pytorch would be extremely helpful since it is not trivial to build.

@lucbruni-amd
Copy link

Hi @gugarosa,

Sorry for the late response. This (being TE included in rocm/pytorch specifically) has been implemented and is set to be released in a future version. I'll update this issue when I have more details.

Thanks for your patience in the matter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants