Regarding label coordinate processing.

Hello, many method uses coordinates scale into 0-1, but input image will be resized and padded, for example, llava.
This seems doesn't make sense.

What's the right process way in MLLM in terms of OVD or DOD task?