I noticed in Table 12 of your paper, the hyperparameter ($\beta$) is set to a very low value, 1e-8, which suggests that the proposed code-based distillation process plays an almost negligible role during training. This is quite puzzling.
Could you explain?