Several questions on reproducing the training steps #125

DerickJin3316 · 2025-02-13T09:34:09Z

Hi,
I appreciate your work so much. However, when I tried to reproduce some training results, I encountered several questions on the implementation details.

(referring to the AM-RADIO paper here) I didn't quite understand how the student is trained under multiple resolutions since I suppose CLIP works under fixed resolutions, and the student must match the shape of the teacher's features. So did you use different CLIP models during the two training stages? (Concretely, CLIP@224px to match 256px input at stage 1, and CLIP@378px to match 432px input at stage 2.) Or you interpolate the features just like DINOv2.
Do the two stages belong to a single run? In other words, you cover 600k steps by a single-cycle CosineAnnealing schedule, right? Would it make sense if I run the first 300k steps, save&load the ckpt, then begin a new run with fresh optimizer/scheduler to run the last 300k steps?
According to the code, it seems like you bicubicly interpolate the data to match the input resolution. I wonder if it’s the same for SAM hi-res inputs. Is there a need to use a selected subset of Datacomp1B (or other datasets) that has relatively higher original resolution to avoid poor interpolation quality?
Previous Github Issues asked about some details about the ablation study shown in Table 3. Instead, I’m focusing on Table 2 results (training dataset ablation study) recently. Do these studies use SAM teacher? Do they use the two-stage multi-resolution setting like the best model setting? According to my understanding, these are where the settings could be different between Table 2 row 3 and Table 3 row 4. Since their result metrics differ, I wonder where this divergence comes from.
A question about the RADIO-amplified paper, in section 4.6 partitioning. I’m not sure how partitioning impacts the training results. Is it correct to say that, in essential, “in a partition” means the teachers receive the same data within one step, and “in different partitions” means they are receiving different data and can have different batch sizes? But how does this impact teacher overhead?

I’m an undergraduate and didn’t have much experience with model training before, so some of these questions might sound trivial.
Thanks much for your answer!

mranzinger · 2025-02-13T20:47:09Z

(referring to the AM-RADIO paper here) I didn't quite understand how the student is trained under multiple resolutions since I suppose CLIP works under fixed resolutions, and the student must match the shape of the teacher's features. So did you use different CLIP models during the two training stages? (Concretely, CLIP@224px to match 256px input at stage 1, and CLIP@378px to match 432px input at stage 2.) Or you interpolate the features just like DINOv2.

We use the same teachers across stages, and interpolate the output features to match. The current algorithm resamples the student and teacher features to the minimum size of the two.

Do the two stages belong to a single run? In other words, you cover 600k steps by a single-cycle CosineAnnealing schedule, right? Would it make sense if I run the first 300k steps, save&load the ckpt, then begin a new run with fresh optimizer/scheduler to run the last 300k steps?

One stage totally makes sense, but out of sheer laziness/time constraints, between stages we actually just start over with the initial 1e-3 learning rate and a new schedule. Totally possible that we're leaving something on the table in that regard.

According to the code, it seems like you bicubicly interpolate the data to match the input resolution. I wonder if it’s the same for SAM hi-res inputs. Is there a need to use a selected subset of Datacomp1B (or other datasets) that has relatively higher original resolution to avoid poor interpolation quality?

We actually use bilinear resampling for images. Even true for the images we feed SAM. It totally makes sense to hand higher-res images to the hi-res partition during training, because you're right, most of the images in DataComp-1B actually aren't very hi-res.

Previous Github Issues asked about some details about the ablation study shown in Table 3. Instead, I’m focusing on Table 2 results (training dataset ablation study) recently. Do these studies use SAM teacher? Do they use the two-stage multi-resolution setting like the best model setting?

Tables 2 and 3 do not include SAM. We re-ran the models in both of those settings. The key difference is that Table 3 used [OpenAI CLIP 336px, DINOv2] as the teachers, whereas Table 2 used [MetaCLIP, DINOv2] as the teachers. This came down to timing as, over the course of writing the paper, both MetaCLIP and DFN CLIP were released to the public. Admittedly, it makes the paper a bit harder to track, but the most important part is just that the ablations are internally consistent.

A question about the RADIO-amplified paper, in section 4.6 partitioning. I’m not sure how partitioning impacts the training results. Is it correct to say that, in essential, “in a partition” means the teachers receive the same data within one step, and “in different partitions” means they are receiving different data and can have different batch sizes? But how does this impact teacher overhead?

Your understanding is correct. If multiple teachers are sharing a partition, then they're receiving the same images (although the sizes may be different, based on the teacher input resolution). Teachers on different partitions are operating on different GPUs, and are receiving different data. Teacher overhead changes because in the multi-partition scheme, each teacher's effective batch size is reduced per step.

Take this example:

Batch Size: 1024, Teachers: CLIP, DINOv2

1 partition: Both teachers get 1024 images
2 partitions: CLIP gets 512 images, DINOv2 gets a different 512 images

For simplicity, if we assume that both teachers are equally expensive to run inference on, then in the 2-partition scheme, we cut the teacher overhead in half. The student still sees all 1024 images, but the training signal is not coming from all teachers for all images.

DerickJin3316 · 2025-02-14T02:38:27Z

That would help a lot. Thanks again for your answer!

DerickJin3316 · 2025-02-26T03:04:30Z

Hi @mranzinger , I have several questions on the PHI-S paper. In reproducing Figure 1 (visualization), should I use the teacher summaries or the teacher features? Should I do PCA together (by stacking these representations together) or do it separately for each teacher? If I should do PCA together, then how should the representation be aligned to the same dimension (I mean, different teacher features or summaries have different shapes)? Thanks for your answer!

mranzinger · 2025-02-26T13:22:22Z

Figure 1 uses the teacher features. What we did to get an estimate of the distributions is we ran about 1k images through a teacher, computed the PCA from C->2 on all features from all images for that teacher, and that is what you see in the plot.

DerickJin3316 · 2025-03-25T08:24:51Z

@mranzinger Thanks so much for your answer. I have another question: How should I get the PHI-S matrix that can be used during training since the whole training dataset is too big to calculate the data covariance matrix? Should I sample a small portion of the data, run them through the teacher, compute the matrix with these features, and use it during the whole training/inference process? Or should I dynamically update the matrix when every minibatch comes in?

DerickJin3316 · 2025-03-25T12:15:58Z

I also wonder why there isn't an inversion module in the RADIOModel structure. As you mentioned in the PHI-S paper, if the student learns the normalized distribution, the outputs should be "de-normalized", but it seems the pretrained weights can be used as plug-in models. Thanks for your answer!

mranzinger · 2025-03-25T13:49:57Z

How should I get the PHI-S matrix that can be used during training since the whole training dataset is too big to calculate the data covariance matrix? Should I sample a small portion of the data, run them through the teacher, compute the matrix with these features, and use it during the whole training/inference process? Or should I dynamically update the matrix when every minibatch comes in?

The way we do it is by dynamically updating the matrix every 100 minibatches, up to the first 3000 minibatches. We use standard running mean and covariance tracking to do this. Once we hit 3k minibatches, we just freeze the matrix and use that for the rest of training.

I also wonder why there isn't an inversion module in the RADIOModel structure. As you mentioned in the PHI-S paper, if the student learns the normalized distribution, the outputs should be "de-normalized", but it seems the pretrained weights can be used as plug-in models.

We have a post-training export script that will modify the final linear layer of each adaptor to perform the inversion. We detail the process in section A.6 of the PHI-S paper. Namely, we replace the weights of the final linear layer using equation 44, with the Θ coming from equation 50.

DerickJin3316 · 2025-03-26T07:48:26Z

Appreciate it!

DerickJin3316 closed this as completed Feb 14, 2025

DerickJin3316 reopened this Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Several questions on reproducing the training steps #125

Several questions on reproducing the training steps #125

DerickJin3316 commented Feb 13, 2025

mranzinger commented Feb 13, 2025

DerickJin3316 commented Feb 14, 2025

DerickJin3316 commented Feb 26, 2025

mranzinger commented Feb 26, 2025

DerickJin3316 commented Mar 25, 2025

DerickJin3316 commented Mar 25, 2025

mranzinger commented Mar 25, 2025

DerickJin3316 commented Mar 26, 2025

Several questions on reproducing the training steps #125

Several questions on reproducing the training steps #125

Comments

DerickJin3316 commented Feb 13, 2025

mranzinger commented Feb 13, 2025

DerickJin3316 commented Feb 14, 2025

DerickJin3316 commented Feb 26, 2025

mranzinger commented Feb 26, 2025

DerickJin3316 commented Mar 25, 2025

DerickJin3316 commented Mar 25, 2025

mranzinger commented Mar 25, 2025

DerickJin3316 commented Mar 26, 2025