AUTOMATIC MIXED PRECISION

Has anyone tried torch.cuda.amp?
Seems that ms_attention doesn't support fp16 even after I modified ms_deform_attn_forward_cuda
Any other way to implement amp? Or is there any ways to reduce the GPU memory? I got cuda OOM for bs=4 every time