Hi, thank you for the amazing work with NBoost.
My question is regarding TinyBERT. As per this accompanying blog post, TinyBERT is obtained using knowledge distillation on a larger BERT architecture that is pre-trained on MS-MARCO.
How does this approach compare with training a TinyBERT architecture from scratch on the MS-MARCO dataset?