-
Notifications
You must be signed in to change notification settings - Fork 42
Several questions on reproducing the training steps #125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We use the same teachers across stages, and interpolate the output features to match. The current algorithm resamples the student and teacher features to the minimum size of the two.
One stage totally makes sense, but out of sheer laziness/time constraints, between stages we actually just start over with the initial 1e-3 learning rate and a new schedule. Totally possible that we're leaving something on the table in that regard.
We actually use bilinear resampling for images. Even true for the images we feed SAM. It totally makes sense to hand higher-res images to the hi-res partition during training, because you're right, most of the images in DataComp-1B actually aren't very hi-res.
Tables 2 and 3 do not include SAM. We re-ran the models in both of those settings. The key difference is that Table 3 used [OpenAI CLIP 336px, DINOv2] as the teachers, whereas Table 2 used [MetaCLIP, DINOv2] as the teachers. This came down to timing as, over the course of writing the paper, both MetaCLIP and DFN CLIP were released to the public. Admittedly, it makes the paper a bit harder to track, but the most important part is just that the ablations are internally consistent.
Your understanding is correct. If multiple teachers are sharing a partition, then they're receiving the same images (although the sizes may be different, based on the teacher input resolution). Teachers on different partitions are operating on different GPUs, and are receiving different data. Teacher overhead changes because in the multi-partition scheme, each teacher's effective batch size is reduced per step. Take this example: Batch Size: 1024, Teachers: CLIP, DINOv2 1 partition: Both teachers get 1024 images For simplicity, if we assume that both teachers are equally expensive to run inference on, then in the 2-partition scheme, we cut the teacher overhead in half. The student still sees all 1024 images, but the training signal is not coming from all teachers for all images. |
That would help a lot. Thanks again for your answer! |
Hi @mranzinger , I have several questions on the PHI-S paper. In reproducing Figure 1 (visualization), should I use the teacher summaries or the teacher features? Should I do PCA together (by stacking these representations together) or do it separately for each teacher? If I should do PCA together, then how should the representation be aligned to the same dimension (I mean, different teacher features or summaries have different shapes)? Thanks for your answer! |
Figure 1 uses the teacher features. What we did to get an estimate of the distributions is we ran about 1k images through a teacher, computed the PCA from C->2 on all features from all images for that teacher, and that is what you see in the plot. |
@mranzinger Thanks so much for your answer. I have another question: How should I get the PHI-S matrix that can be used during training since the whole training dataset is too big to calculate the data covariance matrix? Should I sample a small portion of the data, run them through the teacher, compute the matrix with these features, and use it during the whole training/inference process? Or should I dynamically update the matrix when every minibatch comes in? |
I also wonder why there isn't an inversion module in the RADIOModel structure. As you mentioned in the PHI-S paper, if the student learns the normalized distribution, the outputs should be "de-normalized", but it seems the pretrained weights can be used as plug-in models. Thanks for your answer! |
The way we do it is by dynamically updating the matrix every 100 minibatches, up to the first 3000 minibatches. We use standard running mean and covariance tracking to do this. Once we hit 3k minibatches, we just freeze the matrix and use that for the rest of training.
We have a post-training export script that will modify the final linear layer of each adaptor to perform the inversion. We detail the process in section A.6 of the PHI-S paper. Namely, we replace the weights of the final linear layer using equation 44, with the Θ coming from equation 50. |
Appreciate it! |
Hi,
I appreciate your work so much. However, when I tried to reproduce some training results, I encountered several questions on the implementation details.
I’m an undergraduate and didn’t have much experience with model training before, so some of these questions might sound trivial.
Thanks much for your answer!
The text was updated successfully, but these errors were encountered: