-
Notifications
You must be signed in to change notification settings - Fork 26
Description
For my use case I would find it extremely useful to perform distillation with a KL objective, with a larger teacher model (gradient-free) and smaller student model which has losses/gradients.
I found this feature implemented in deepspeed last year:
huggingface/accelerate#2496
huggingface/accelerate#3097
https://huggingface.co/docs/accelerate/en/usage_guides/deepspeed_multiple_model
So, it's possible that doing multiple models is just a matter of following the tutorial^ and applying it to model loading in the trainer.__init__:
| def __init__(self, config: TrainerConfig, mode: str = "train") -> None: |
if this is the case that would be awesome- and probably means I can accomplish this solely through creating a custom Trainer class, which I understand is expected of users anyway.
Reading through this tutorial / issues + the arctic_training.trainer, I think what is needed is to rewrite model loading & engine initialization using utils.DeepSpeedPlugins. My main confusion is I don't know where the accelerator init and prepare calls are happening in ArcticTraining- or whether they happen at all. How can the deepspeed/accelerate setup be mapped onto the multiple models tutorial with this in mind? The accelerator calls seem necessary to follow the exact flow in the tutorial.
tldr; this feature might be purely expected of users to write, or may require some rejigging of the trainer, I am just not sure.