Skip to content

Conversation

@zatkins-dev
Copy link
Collaborator

Adds support for mixed precision operators, set at creation time, for the CUDA gen backend. This is made possible by defining a second scalar type CeedScalarCPU, which is always double precision, and changing the CeedScalar type during the JiT pass if the CEED_JIT_MIXED_PRECISION define is set.

The inputs to the device kernels are always CeedScalarCPU arrays, to avoid having to muck around with multiple pointers and such in a CeedVector. In gen, we only do things to the input and output arrays at the beginning and end of the kernel, so all of the computation will be happening with the single precision CeedScalar arrays we copy values into. This approach minimizes the code differences between mixed and full precision runs, essentially just requiring the helper functions to have extra template parameters to ensure the input types are correct.

The support for mixed precision operators is at the backend level, while the actual usage of mixed precision operations is defined per-operator to provide maximal flexibility.

This can be extended to the CUDA ref backend too, though the benefits will likely be more mild.

@jeremylt and @nbeams, does this seem like a reasonable approach?

@zatkins-dev zatkins-dev force-pushed the zach/mixed-precision branch 2 times, most recently from 84d2396 to c08437d Compare July 11, 2025 15:35
@zatkins-dev
Copy link
Collaborator Author

Should we change the way flops are reported (say, cut them in half) for mixed precision operators?

@zatkins-dev zatkins-dev self-assigned this Jul 11, 2025
@jrwrigh
Copy link
Collaborator

jrwrigh commented Jul 11, 2025

Should we change the way flops are reported (say, cut them in half) for mixed precision operators?

I don't think so; a FLOP is a FLOP. There can be some other mechanism for adjusting that if the end-user wants to, but by default I don't think the count should be changed.

// Reduce sub and super diagonal
CeedInt p = 0, q = 0, itr = 0, max_itr = n * n * n * n;
CeedScalar tol = CEED_EPSILON;
CeedScalar tol = 10 * CEED_EPSILON;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stray temp fix?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no that's a real fix. It isn't able to get to machine precision, now that CEED_EPSILON is machine precision

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably worth putting into a separate PR if it applies to main?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it technically only applies here, since CEED_EPSILON is a bit bigger on main, but fair

code << tab << "// -----------------------------------------------------------------------------\n";
code << tab << "extern \"C\" __global__ void " << operator_name
<< "(CeedInt num_elem, void* ctx, FieldsInt_Cuda indices, Fields_Cuda fields, Fields_Cuda B, Fields_Cuda G, CeedScalar *W, Points_Cuda "
<< "(CeedInt num_elem, void* ctx, FieldsInt_Cuda indices, Fields_Cuda fields, Fields_Cuda B, Fields_Cuda G, CeedScalarCPU *W, Points_Cuda "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I'm not a fan of the CeedScalarCPU name, especially since it's the precision that will be used by other GPU things beyond the mixed precision operators, as well, not just on CPU. Is there a reason why we can't keep this as CeedScalar (the base precision), and use a different name for the precision that will be changed in the operator? Like CeedScalarOpCompute, CeedScalarMixedCompute, some other better name someone else comes up with...
Or CeedScalarCPU could maybe be CeedScalarBase?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with that is that user code, e.g. in QFunctions, uses CeedScalar internally. By changing what CeedScalar means, we can prevent needing all user code to use a different scalar type to support mixed precision operators.

I think CeedScalarBase would be reasonable, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that is the "at rest" or "storage" or "outer" representation (not opposed to "base" either).

What do we anticipate doing about qdata? Will it also always be converted for the qfunction or might it be stored in the inner representation? (I can imagine running kernels is f32 while Krylov runs in f64, though it looks like we've been going the other way in examples.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The qdata is copied into arrays of type CeedScalar inside the kernel. So the user will keep using their base precision and it will be internally converted for the QFunction.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The examples go both ways - both higher and lower precision for the internal operator. I think that the f64 base, f32 inner is the primary use case though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may later want to support qdata at rest in the inner representation (specifically, when the inner representation is smaller). I can imagine that being done by making CeedVector handle two pointers (it won't even need axpy and dot product functionality).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're anticipating a pref benefit from copying single->single instead of double->single? Makes sense, though we'd probably want to be a bit thoughtful about possible cases where the vector is created in double and edited in single?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That kind of dual pointer complexity was what I was trying to avoid with this approach. For GPUs, that adds substantially more complexity due to already having 2+ pointers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The perf benefit would be from having the heaviest data structure in the operator be half the size at rest. I definitely don't want to scope-creep it into this PR, but I think it's something very natural for people to want. The complexity is sort of an outer product with the host/device memory spaces, though we don't need to implement the algebraic operations if we're only using that functionality for qdata.

code << tab << "// -----------------------------------------------------------------------------\n";
code << tab << "extern \"C\" __global__ void " << operator_name
<< "(CeedInt num_elem, void* ctx, FieldsInt_Cuda indices, Fields_Cuda fields, Fields_Cuda B, Fields_Cuda G, CeedScalar *W, Points_Cuda "
<< "(CeedInt num_elem, void* ctx, FieldsInt_Cuda indices, Fields_Cuda fields, Fields_Cuda B, Fields_Cuda G, CeedScalarCPU *W, Points_Cuda "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that is the "at rest" or "storage" or "outer" representation (not opposed to "base" either).

What do we anticipate doing about qdata? Will it also always be converted for the qfunction or might it be stored in the inner representation? (I can imagine running kernels is f32 while Krylov runs in f64, though it looks like we've been going the other way in examples.)

@zatkins-dev zatkins-dev force-pushed the zach/mixed-precision branch 2 times, most recently from 48750fd to 78ee22c Compare July 24, 2025 20:04
@zatkins-dev zatkins-dev force-pushed the zach/mixed-precision branch from ece208a to 4ab0240 Compare July 24, 2025 21:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants