partition_activations produces no activation memory improvement with zero3

Hi, I am trying to run a gpt2 model with blocksize 2048, and I cannot use batchsize larger than 16 because activation memory becomes too large.
To reduce activation memory I already use deepspeed actication checkpointing on each transformer block +amp.
I saw there is an option to partition / shard activations too, advertized by megatron. But when I try it I see no effect at all.