-
Couldn't load subscription status.
- Fork 61
gen - add support for mixed precision operators #1853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
84d2396 to
c08437d
Compare
|
Should we change the way flops are reported (say, cut them in half) for mixed precision operators? |
I don't think so; a FLOP is a FLOP. There can be some other mechanism for adjusting that if the end-user wants to, but by default I don't think the count should be changed. |
| // Reduce sub and super diagonal | ||
| CeedInt p = 0, q = 0, itr = 0, max_itr = n * n * n * n; | ||
| CeedScalar tol = CEED_EPSILON; | ||
| CeedScalar tol = 10 * CEED_EPSILON; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stray temp fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no that's a real fix. It isn't able to get to machine precision, now that CEED_EPSILON is machine precision
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably worth putting into a separate PR if it applies to main?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it technically only applies here, since CEED_EPSILON is a bit bigger on main, but fair
4a7a755 to
767aa77
Compare
| code << tab << "// -----------------------------------------------------------------------------\n"; | ||
| code << tab << "extern \"C\" __global__ void " << operator_name | ||
| << "(CeedInt num_elem, void* ctx, FieldsInt_Cuda indices, Fields_Cuda fields, Fields_Cuda B, Fields_Cuda G, CeedScalar *W, Points_Cuda " | ||
| << "(CeedInt num_elem, void* ctx, FieldsInt_Cuda indices, Fields_Cuda fields, Fields_Cuda B, Fields_Cuda G, CeedScalarCPU *W, Points_Cuda " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I'm not a fan of the CeedScalarCPU name, especially since it's the precision that will be used by other GPU things beyond the mixed precision operators, as well, not just on CPU. Is there a reason why we can't keep this as CeedScalar (the base precision), and use a different name for the precision that will be changed in the operator? Like CeedScalarOpCompute, CeedScalarMixedCompute, some other better name someone else comes up with...
Or CeedScalarCPU could maybe be CeedScalarBase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue with that is that user code, e.g. in QFunctions, uses CeedScalar internally. By changing what CeedScalar means, we can prevent needing all user code to use a different scalar type to support mixed precision operators.
I think CeedScalarBase would be reasonable, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that is the "at rest" or "storage" or "outer" representation (not opposed to "base" either).
What do we anticipate doing about qdata? Will it also always be converted for the qfunction or might it be stored in the inner representation? (I can imagine running kernels is f32 while Krylov runs in f64, though it looks like we've been going the other way in examples.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The qdata is copied into arrays of type CeedScalar inside the kernel. So the user will keep using their base precision and it will be internally converted for the QFunction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The examples go both ways - both higher and lower precision for the internal operator. I think that the f64 base, f32 inner is the primary use case though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may later want to support qdata at rest in the inner representation (specifically, when the inner representation is smaller). I can imagine that being done by making CeedVector handle two pointers (it won't even need axpy and dot product functionality).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're anticipating a pref benefit from copying single->single instead of double->single? Makes sense, though we'd probably want to be a bit thoughtful about possible cases where the vector is created in double and edited in single?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That kind of dual pointer complexity was what I was trying to avoid with this approach. For GPUs, that adds substantially more complexity due to already having 2+ pointers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The perf benefit would be from having the heaviest data structure in the operator be half the size at rest. I definitely don't want to scope-creep it into this PR, but I think it's something very natural for people to want. The complexity is sort of an outer product with the host/device memory spaces, though we don't need to implement the algebraic operations if we're only using that functionality for qdata.
| code << tab << "// -----------------------------------------------------------------------------\n"; | ||
| code << tab << "extern \"C\" __global__ void " << operator_name | ||
| << "(CeedInt num_elem, void* ctx, FieldsInt_Cuda indices, Fields_Cuda fields, Fields_Cuda B, Fields_Cuda G, CeedScalar *W, Points_Cuda " | ||
| << "(CeedInt num_elem, void* ctx, FieldsInt_Cuda indices, Fields_Cuda fields, Fields_Cuda B, Fields_Cuda G, CeedScalarCPU *W, Points_Cuda " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that is the "at rest" or "storage" or "outer" representation (not opposed to "base" either).
What do we anticipate doing about qdata? Will it also always be converted for the qfunction or might it be stored in the inner representation? (I can imagine running kernels is f32 while Krylov runs in f64, though it looks like we've been going the other way in examples.)
48750fd to
78ee22c
Compare
ece208a to
4ab0240
Compare
Adds support for mixed precision operators, set at creation time, for the CUDA gen backend. This is made possible by defining a second scalar type
CeedScalarCPU, which is always double precision, and changing theCeedScalartype during the JiT pass if theCEED_JIT_MIXED_PRECISIONdefine is set.The inputs to the device kernels are always
CeedScalarCPUarrays, to avoid having to muck around with multiple pointers and such in aCeedVector. In gen, we only do things to the input and output arrays at the beginning and end of the kernel, so all of the computation will be happening with the single precisionCeedScalararrays we copy values into. This approach minimizes the code differences between mixed and full precision runs, essentially just requiring the helper functions to have extra template parameters to ensure the input types are correct.The support for mixed precision operators is at the backend level, while the actual usage of mixed precision operations is defined per-operator to provide maximal flexibility.
This can be extended to the CUDA ref backend too, though the benefits will likely be more mild.
@jeremylt and @nbeams, does this seem like a reasonable approach?