-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Open
Description
I'm curious about the data format of pure-text used in pre-training in DeepSeek-VL2 but I haven't found specific details.
In LLM pre-training, the loss is typically computed over all tokens. However, in traditional MLLM pre-training, the data format often resembles a multi-turn QA setup, and only the answer tokens are used for loss computation — the system prompt and question are usually masked.
So, what approach does DeepSeek-VL2 use?
Metadata
Metadata
Assignees
Labels
No labels