How can distillation be carried out under these circumstances?

Dear Professor,

I would like to ask some questions. In your experiment, you used the models "teacher": "arcee-ai/Arcee-Spark" and "student": "Qwen/Qwen2-1.5B" for the distil-logits task. I am wondering how to perform distillation when the vocabulary sizes of these two models are different and the word indices in their vocabularies do not match. The shape of the teacher's logits is [b, seq_len, vocabulary_size_spark], and the student's logits shape is [b, seq_len, vocabulary_size_qwen]. How can distillation be carried out under these circumstances?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How can distillation be carried out under these circumstances? #12

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How can distillation be carried out under these circumstances? #12

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions