You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Current implementation of Permute distributed, i.e. permuteOnCPU, currently allocates memory at schedule-time for:
inverse_perms
local2global_index
packing_index
unpacking_index
mat_send
mat_recv
If the ones for indices represents a waste but not really a concern, mat_send and mat_recv are a bit more "expensive" and might represent a limit factor for big runs.
For example, the D&C tridiagonal solver calls multiple times this function on different sub matrices obtained by splitting the original one in different ways. This results in allocating a lot of "support memory" all at once at schedule time, when it could, at least in principle be allocated:
just when needed (at runtime),
but even more re-used among different calls, that will not anyway run in parallel due to the nature of the algorithm.