You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Why is the feature dimension of clip_summary 1024 (the same as clip_text_embeddings), but the feature dimension of clip_features is 1280?
DFN CLIP has a linear projection from 1280->1024 for the summary token of the ViT, and our summary head is directly learning that 1024d feature space. A ViT-H model has an internal dimension of 1280, and the DFN CLIP projection isn't applied to the spatial features. So that's why there's the mismatch.
Is the dense clip_features well-aligned with the CLIP space, and can I use it for open-vocabulary tasks?
I am aware of one work that's under review at CVPR that's using the semantic alignment for 3D open vocabulary segmentation. A different model to consider would be the recently proposed dino.txt as they explicitly align the text and spatial features.
Hi, this is really exciting work, and thank you for providing such a comprehensive GitHub repo!
I'm currently trying to use RADIO to extract dense (patch-level) CLIP features from an image for open-vocabulary tasks. My code looks like this:
I have the following questions:
The text was updated successfully, but these errors were encountered: