[CB] Reduce wastage in prefill compute and pad blocks in homogeneous continuous batching

### Reduce wastage in prefill compute and pad blocks in homogeneous continuous batching

implement optimization idea by @JRosenkranz: do prefill only on next multiple of block size and then during decode pad with (valid) block id. Reduces computes for prefill and does not waist any valid blocks ids if whole blocks are padded to make tkv homogeneous. 

related PR #262