Skip to content

Conversation

yhmtsai
Copy link
Member

@yhmtsai yhmtsai commented Dec 5, 2024

Nvidia GPU before compute capability 5.3 does not support IEEE FP16 (half) precision operation.
However, cuda 12.2+ relax these some limitation, so we still support these GPUs with GINKGO_ENABLE_HALF=ON after cuda 12.2.
We only support these GPUs with GINKGO_ENABLE_HALF=OFF before cuda 12.2

  • wait for the check on the real GPU below the requirment

@ginkgo-bot ginkgo-bot added mod:cuda This is related to the CUDA module. mod:hip This is related to the HIP module. labels Dec 5, 2024
@upsj
Copy link
Member

upsj commented Dec 5, 2024

Just a nit with the description: There is no 5.3, the next compute capability is 6.0. missed the Jetsons

@yhmtsai yhmtsai force-pushed the disable_half_properly branch from 80628a4 to e6dd088 Compare December 5, 2024 10:09
Copy link
Member

@upsj upsj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after fixing 5.3

@upsj
Copy link
Member

upsj commented Dec 5, 2024

I removed the last 5.2 device I had, because the system kept running into kernel panics. I would be fine if we didn't test this, or pushed the device requirements to 6.0, because that's the only architecture we actually have GPUs available for.

@upsj
Copy link
Member

upsj commented Dec 5, 2024

For reference, the first devices with 6.x support were released around 2016

@yhmtsai
Copy link
Member Author

yhmtsai commented Dec 5, 2024

5.3 is listed in cuda 11.0

@upsj
Copy link
Member

upsj commented Dec 5, 2024

That's the Jetson Nano, I don't think we need to consider them.

@pratikvn
Copy link
Member

pratikvn commented Dec 5, 2024

The Tegra X1 and Jetson TX1 GPUs are 5.3. They officially do have half precision FP support, so technically with this PR we should be able to run on them. But I am not sure it is easy to confirm that, as we dont have access to those.

@upsj
Copy link
Member

upsj commented Dec 5, 2024

I was thinking earlier about the term "supported" - what does it mean when we state that we support certain versions? We still list classical Intel compiler support, but we no longer have pipelines for it. Does it mean it should work, so does it mean we will fix issues if it doesn't work, or does it mean we are actively monitoring it via CI?

@MarcelKoch
Copy link
Member

I was thinking earlier about the term "supported" - what does it mean when we state that we support certain versions? We still list classical Intel compiler support, but we no longer have pipelines for it. Does it mean it should work, so does it mean we will fix issues if it doesn't work, or does it mean we are actively monitoring it via CI?

I think it means the first on. We also list GCC 7+, so we expect that all versions 7.0-14.? work, but of course we are not (and will not) testing every version in our CI.

@upsj
Copy link
Member

upsj commented Dec 5, 2024

Fixing issues means we should also be able to confirm when they are fixed. If we don't have the necessary hardware available, I would try to avoid claiming we can support them.

@pratikvn
Copy link
Member

pratikvn commented Dec 5, 2024

Some libraries split them into two: Minimum supported version (confirmed/tested with CI): x.x.x and Might work (untested with CI, but might work), but I guess we cant really call the latter "Supported", and can say that we support only versions/hardware we test on.

@upsj
Copy link
Member

upsj commented Dec 5, 2024

Yes, I was thinking along the same lines. If we split up the list of supported versions, we can be more explicit.

@yhmtsai
Copy link
Member Author

yhmtsai commented Dec 5, 2024

Will it also apply CPU? and we can only claim H100/L40s/TitanX if we limit to the available hardwares we have now.
I do not think we should aim for claiming support only on the available hardwares.
We can additionaly said it is constantly tested on certain platform or just forward it to out pipeline.

@upsj
Copy link
Member

upsj commented Dec 5, 2024

We have P100, Titan X, V100, RTX 2060, A2 (potentially A100 if we need it), L40S and H100 available, so we could cover the full range if we wanted to. For AMD we have MI50, MI100, MI210, MI250X and MI300 available. I would suggest having a "full full" pipeline available that tests one configuration on each supported hardware version, which we can use for pre-release checks in the future.

@yhmtsai
Copy link
Member Author

yhmtsai commented Dec 5, 2024

Also, I do not think additional claim on the testing hardware help anything.
Users run it with V100 or A100 for example and face issues. We will still need to fix it, right?

@upsj
Copy link
Member

upsj commented Dec 5, 2024

For CPUs: Contrary to SASS, x86_64 is stable between hardware architectures (we should likely add a variant of aarch64), and we don't compile with -march=native, so beyond CPU bugs, it is unlikely to lead to different behaviors between microarchitectures.

@yhmtsai yhmtsai added this to the Ginkgo 1.9.0 milestone Dec 5, 2024
Copy link
Member

@thoasm thoasm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yhmtsai yhmtsai force-pushed the disable_half_properly branch from e6dd088 to 2d5c6c4 Compare December 5, 2024 14:19
@yhmtsai yhmtsai added 1:ST:run-full-test 1:ST:no-changelog-entry Skip the wiki check for changelog update labels Dec 5, 2024
@yhmtsai yhmtsai added the 1:ST:ready-to-merge This PR is ready to merge. label Dec 5, 2024
@yhmtsai yhmtsai changed the title disable half precision support properly support properly when disable half precision Dec 5, 2024
@yhmtsai yhmtsai merged commit f95fc48 into develop Dec 5, 2024
8 of 11 checks passed
@yhmtsai yhmtsai deleted the disable_half_properly branch December 5, 2024 16:52
Copy link

sonarqubecloud bot commented Dec 5, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

1:ST:no-changelog-entry Skip the wiki check for changelog update 1:ST:ready-to-merge This PR is ready to merge. 1:ST:run-full-test mod:cuda This is related to the CUDA module. mod:hip This is related to the HIP module.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants