support properly when disable half precision #1742

yhmtsai · 2024-12-05T10:07:24Z

Nvidia GPU before compute capability 5.3 does not support IEEE FP16 (half) precision operation.
However, cuda 12.2+ relax these some limitation, so we still support these GPUs with GINKGO_ENABLE_HALF=ON after cuda 12.2.
We only support these GPUs with GINKGO_ENABLE_HALF=OFF before cuda 12.2

wait for the check on the real GPU below the requirment

upsj · 2024-12-05T10:09:38Z

~~Just a nit with the description: There is no 5.3, the next compute capability is 6.0.~~ missed the Jetsons

upsj

LGTM after fixing 5.3

upsj · 2024-12-05T10:19:33Z

I removed the last 5.2 device I had, because the system kept running into kernel panics. I would be fine if we didn't test this, or pushed the device requirements to 6.0, because that's the only architecture we actually have GPUs available for.

upsj · 2024-12-05T10:24:19Z

For reference, the first devices with 6.x support were released around 2016

yhmtsai · 2024-12-05T11:10:30Z

5.3 is listed in cuda 11.0

upsj · 2024-12-05T11:23:35Z

That's the Jetson Nano, I don't think we need to consider them.

pratikvn · 2024-12-05T11:25:31Z

The Tegra X1 and Jetson TX1 GPUs are 5.3. They officially do have half precision FP support, so technically with this PR we should be able to run on them. But I am not sure it is easy to confirm that, as we dont have access to those.

upsj · 2024-12-05T11:36:27Z

I was thinking earlier about the term "supported" - what does it mean when we state that we support certain versions? We still list classical Intel compiler support, but we no longer have pipelines for it. Does it mean it should work, so does it mean we will fix issues if it doesn't work, or does it mean we are actively monitoring it via CI?

MarcelKoch · 2024-12-05T11:45:54Z

I was thinking earlier about the term "supported" - what does it mean when we state that we support certain versions? We still list classical Intel compiler support, but we no longer have pipelines for it. Does it mean it should work, so does it mean we will fix issues if it doesn't work, or does it mean we are actively monitoring it via CI?

I think it means the first on. We also list GCC 7+, so we expect that all versions 7.0-14.? work, but of course we are not (and will not) testing every version in our CI.

upsj · 2024-12-05T11:48:17Z

Fixing issues means we should also be able to confirm when they are fixed. If we don't have the necessary hardware available, I would try to avoid claiming we can support them.

pratikvn · 2024-12-05T11:52:24Z

Some libraries split them into two: Minimum supported version (confirmed/tested with CI): x.x.x and Might work (untested with CI, but might work), but I guess we cant really call the latter "Supported", and can say that we support only versions/hardware we test on.

upsj · 2024-12-05T11:53:28Z

Yes, I was thinking along the same lines. If we split up the list of supported versions, we can be more explicit.

yhmtsai · 2024-12-05T11:53:36Z

Will it also apply CPU? and we can only claim H100/L40s/TitanX if we limit to the available hardwares we have now.
I do not think we should aim for claiming support only on the available hardwares.
We can additionaly said it is constantly tested on certain platform or just forward it to out pipeline.

upsj · 2024-12-05T11:56:03Z

We have P100, Titan X, V100, RTX 2060, A2 (potentially A100 if we need it), L40S and H100 available, so we could cover the full range if we wanted to. For AMD we have MI50, MI100, MI210, MI250X and MI300 available. I would suggest having a "full full" pipeline available that tests one configuration on each supported hardware version, which we can use for pre-release checks in the future.

yhmtsai · 2024-12-05T11:56:14Z

Also, I do not think additional claim on the testing hardware help anything.
Users run it with V100 or A100 for example and face issues. We will still need to fix it, right?

upsj · 2024-12-05T12:07:35Z

For CPUs: Contrary to SASS, x86_64 is stable between hardware architectures (we should likely add a variant of aarch64), and we don't compile with -march=native, so beyond CPU bugs, it is unlikely to lead to different behaviors between microarchitectures.

thoasm

LGTM

INSTALL.md

sonarqubecloud · 2024-12-05T19:46:00Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

ginkgo-bot added mod:cuda This is related to the CUDA module. mod:hip This is related to the HIP module. labels Dec 5, 2024

yhmtsai force-pushed the disable_half_properly branch from 80628a4 to e6dd088 Compare December 5, 2024 10:09

upsj approved these changes Dec 5, 2024

View reviewed changes

yhmtsai added this to the Ginkgo 1.9.0 milestone Dec 5, 2024

thoasm approved these changes Dec 5, 2024

View reviewed changes

INSTALL.md Outdated Show resolved Hide resolved

yhmtsai added 2 commits December 5, 2024 15:19

disable the half properly

4a922b1

update documentation

2d5c6c4

yhmtsai force-pushed the disable_half_properly branch from e6dd088 to 2d5c6c4 Compare December 5, 2024 14:19

yhmtsai added 1:ST:run-full-test 1:ST:no-changelog-entry Skip the wiki check for changelog update labels Dec 5, 2024

yhmtsai mentioned this pull request Dec 5, 2024

[release] update changelog #1741

Merged

yhmtsai added the 1:ST:ready-to-merge This PR is ready to merge. label Dec 5, 2024

yhmtsai changed the title ~~disable half precision support properly~~ support properly when disable half precision Dec 5, 2024

yhmtsai merged commit f95fc48 into develop Dec 5, 2024
8 of 11 checks passed

yhmtsai deleted the disable_half_properly branch December 5, 2024 16:52

support properly when disable half precision #1742

support properly when disable half precision #1742

Uh oh!

Conversation

yhmtsai commented Dec 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

upsj commented Dec 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

upsj left a comment

Choose a reason for hiding this comment

Uh oh!

upsj commented Dec 5, 2024

Uh oh!

upsj commented Dec 5, 2024

Uh oh!

yhmtsai commented Dec 5, 2024

Uh oh!

upsj commented Dec 5, 2024

Uh oh!

pratikvn commented Dec 5, 2024

Uh oh!

upsj commented Dec 5, 2024

Uh oh!

MarcelKoch commented Dec 5, 2024

Uh oh!

upsj commented Dec 5, 2024

Uh oh!

pratikvn commented Dec 5, 2024

Uh oh!

upsj commented Dec 5, 2024

Uh oh!

yhmtsai commented Dec 5, 2024

Uh oh!

upsj commented Dec 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yhmtsai commented Dec 5, 2024

Uh oh!

upsj commented Dec 5, 2024

Uh oh!

thoasm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud bot commented Dec 5, 2024

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

yhmtsai commented Dec 5, 2024 •

edited

Loading

upsj commented Dec 5, 2024 •

edited

Loading

upsj commented Dec 5, 2024 •

edited

Loading