gen - add support for mixed precision operators #1853

zatkins-dev · 2025-07-10T23:16:45Z

Adds support for mixed precision operators, set at creation time, for the CUDA gen backend. This is made possible by defining a second scalar type CeedScalarCPU, which is always double precision, and changing the CeedScalar type during the JiT pass if the CEED_JIT_MIXED_PRECISION define is set.

The inputs to the device kernels are always CeedScalarCPU arrays, to avoid having to muck around with multiple pointers and such in a CeedVector. In gen, we only do things to the input and output arrays at the beginning and end of the kernel, so all of the computation will be happening with the single precision CeedScalar arrays we copy values into. This approach minimizes the code differences between mixed and full precision runs, essentially just requiring the helper functions to have extra template parameters to ensure the input types are correct.

The support for mixed precision operators is at the backend level, while the actual usage of mixed precision operations is defined per-operator to provide maximal flexibility.

This can be extended to the CUDA ref backend too, though the benefits will likely be more mild.

@jeremylt and @nbeams, does this seem like a reasonable approach?

zatkins-dev · 2025-07-11T15:38:50Z

Should we change the way flops are reported (say, cut them in half) for mixed precision operators?

jrwrigh · 2025-07-11T20:44:06Z

Should we change the way flops are reported (say, cut them in half) for mixed precision operators?

I don't think so; a FLOP is a FLOP. There can be some other mechanism for adjusting that if the end-user wants to, but by default I don't think the count should be changed.

interface/ceed-operator.c

jrwrigh · 2025-07-11T20:52:00Z

interface/ceed-basis.c

  // Reduce sub and super diagonal
  CeedInt    p = 0, q = 0, itr = 0, max_itr = n * n * n * n;
-  CeedScalar tol = CEED_EPSILON;
+  CeedScalar tol = 10 * CEED_EPSILON;


Stray temp fix?

no that's a real fix. It isn't able to get to machine precision, now that CEED_EPSILON is machine precision

probably worth putting into a separate PR if it applies to main?

it technically only applies here, since CEED_EPSILON is a bit bigger on main, but fair

interface/ceed-operator.c

nbeams · 2025-07-23T14:43:42Z

backends/cuda-gen/ceed-cuda-gen-operator-build.cpp

  code << tab << "// -----------------------------------------------------------------------------\n";
  code << tab << "extern \"C\" __global__ void " << operator_name
-       << "(CeedInt num_elem, void* ctx, FieldsInt_Cuda indices, Fields_Cuda fields, Fields_Cuda B, Fields_Cuda G, CeedScalar *W, Points_Cuda "
+       << "(CeedInt num_elem, void* ctx, FieldsInt_Cuda indices, Fields_Cuda fields, Fields_Cuda B, Fields_Cuda G, CeedScalarCPU *W, Points_Cuda "


Personally, I'm not a fan of the CeedScalarCPU name, especially since it's the precision that will be used by other GPU things beyond the mixed precision operators, as well, not just on CPU. Is there a reason why we can't keep this as CeedScalar (the base precision), and use a different name for the precision that will be changed in the operator? Like CeedScalarOpCompute, CeedScalarMixedCompute, some other better name someone else comes up with...
Or CeedScalarCPU could maybe be CeedScalarBase?

The issue with that is that user code, e.g. in QFunctions, uses CeedScalar internally. By changing what CeedScalar means, we can prevent needing all user code to use a different scalar type to support mixed precision operators.

I think CeedScalarBase would be reasonable, though.

Yeah, that is the "at rest" or "storage" or "outer" representation (not opposed to "base" either).

What do we anticipate doing about qdata? Will it also always be converted for the qfunction or might it be stored in the inner representation? (I can imagine running kernels is f32 while Krylov runs in f64, though it looks like we've been going the other way in examples.)

The qdata is copied into arrays of type CeedScalar inside the kernel. So the user will keep using their base precision and it will be internally converted for the QFunction.

The examples go both ways - both higher and lower precision for the internal operator. I think that the f64 base, f32 inner is the primary use case though.

We may later want to support qdata at rest in the inner representation (specifically, when the inner representation is smaller). I can imagine that being done by making CeedVector handle two pointers (it won't even need axpy and dot product functionality).

You're anticipating a pref benefit from copying single->single instead of double->single? Makes sense, though we'd probably want to be a bit thoughtful about possible cases where the vector is created in double and edited in single?

That kind of dual pointer complexity was what I was trying to avoid with this approach. For GPUs, that adds substantially more complexity due to already having 2+ pointers.

The perf benefit would be from having the heaviest data structure in the operator be half the size at rest. I definitely don't want to scope-creep it into this PR, but I think it's something very natural for people to want. The complexity is sort of an outer product with the host/device memory spaces, though we don't need to implement the algebraic operations if we're only using that functionality for qdata.

tests/t506-operator-mixed.c

jedbrown · 2025-07-24T03:04:28Z

backends/cuda-gen/ceed-cuda-gen-operator-build.cpp

  code << tab << "// -----------------------------------------------------------------------------\n";
  code << tab << "extern \"C\" __global__ void " << operator_name
-       << "(CeedInt num_elem, void* ctx, FieldsInt_Cuda indices, Fields_Cuda fields, Fields_Cuda B, Fields_Cuda G, CeedScalar *W, Points_Cuda "
+       << "(CeedInt num_elem, void* ctx, FieldsInt_Cuda indices, Fields_Cuda fields, Fields_Cuda B, Fields_Cuda G, CeedScalarCPU *W, Points_Cuda "


Yeah, that is the "at rest" or "storage" or "outer" representation (not opposed to "base" either).

What do we anticipate doing about qdata? Will it also always be converted for the qfunction or might it be stored in the inner representation? (I can imagine running kernels is f32 while Krylov runs in f64, though it looks like we've been going the other way in examples.)

zatkins-dev added performance CUDA 0-WIP labels Jul 10, 2025

zatkins-dev force-pushed the zach/mixed-precision branch 2 times, most recently from 84d2396 to c08437d Compare July 11, 2025 15:35

zatkins-dev self-assigned this Jul 11, 2025

jrwrigh reviewed Jul 11, 2025

View reviewed changes

zatkins-dev force-pushed the zach/mixed-precision branch from 4a7a755 to 767aa77 Compare July 11, 2025 21:42

zatkins-dev added 1-In Review and removed 0-WIP labels Jul 16, 2025

nbeams reviewed Jul 23, 2025

View reviewed changes

jedbrown reviewed Jul 24, 2025

View reviewed changes

zatkins-dev force-pushed the zach/mixed-precision branch 2 times, most recently from 48750fd to 78ee22c Compare July 24, 2025 20:04

zatkins-dev added 7 commits July 24, 2025 15:35

gen - add support for mixed precision CUDA operators

92ed396

Tweak tolerances

bb08bdd

change CEED_EPSILON to true constant to appease bindgen

ad73082

Ensure DBL_EPSILON and FLT_EPSILON are defined

2484cd1

Change operator precision to a more flexible interface

4ff82df

Change mixed precision tests to use the same source file

9c69174

Rename CeedScalarCPU -> CeedScalarBase

4ab0240

zatkins-dev force-pushed the zach/mixed-precision branch from ece208a to 4ab0240 Compare July 24, 2025 21:36

Uh oh!

gen - add support for mixed precision operators #1853

Are you sure you want to change the base?

gen - add support for mixed precision operators #1853

Uh oh!

Conversation

zatkins-dev commented Jul 10, 2025

Uh oh!

zatkins-dev commented Jul 11, 2025

Uh oh!

jrwrigh commented Jul 11, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants