Dear Professor,
I would like to ask some questions. In your experiment, you used the models "teacher": "arcee-ai/Arcee-Spark" and "student": "Qwen/Qwen2-1.5B" for the distil-logits task. I am wondering how to perform distillation when the vocabulary sizes of these two models are different and the word indices in their vocabularies do not match. The shape of the teacher's logits is [b, seq_len, vocabulary_size_spark], and the student's logits shape is [b, seq_len, vocabulary_size_qwen]. How can distillation be carried out under these circumstances?