Skip to content

Questions About Dense CLIP Features Alignment and Dimensionality in RADIO for Open-Vocabulary Tasks #128

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Chuan-10 opened this issue Feb 26, 2025 · 2 comments

Comments

@Chuan-10
Copy link

Hi, this is really exciting work, and thank you for providing such a comprehensive GitHub repo!

I'm currently trying to use RADIO to extract dense (patch-level) CLIP features from an image for open-vocabulary tasks. My code looks like this:

model_version = "radio_v2.5-g"
model = torch.hub.load('NVlabs/RADIO', 'radio_model', version=model_version, adaptor_names='clip', progress=True, skip_validation=True)

model.eval()

image_path = 'DSCF5857.JPG'
image = Image.open(image_path).convert('RGB')

image = pil_to_tensor(image).to(dtype=torch.float32)
image.div_(255.0)  # RADIO expects input values to be between 0 and 1
image = image.unsqueeze(0)  # Add batch dimension
nearest_res = model.get_nearest_supported_resolution(*image.shape[-2:])
image = F.interpolate(image, nearest_res, mode='bilinear', align_corners=False)

with torch.no_grad():
    output = model(image, feature_fmt='NCHW')
    bb_summary, bb_features = output['backbone']
    clip_summary, clip_features = output['clip']
    lip_adaptor = model.adaptors['clip']
    text_inputs = ['metal', 'wood']
    tokens = clip_adaptor.tokenizer(text_inputs)
    clip_text_embeddings = clip_adaptor.encode_text(tokens)

I have the following questions:

  1. Why is the feature dimension of clip_summary 1024 (the same as clip_text_embeddings), but the feature dimension of clip_features is 1280?
  2. Is the dense clip_features well-aligned with the CLIP space, and can I use it for open-vocabulary tasks?
@mranzinger
Copy link
Collaborator

Why is the feature dimension of clip_summary 1024 (the same as clip_text_embeddings), but the feature dimension of clip_features is 1280?

DFN CLIP has a linear projection from 1280->1024 for the summary token of the ViT, and our summary head is directly learning that 1024d feature space. A ViT-H model has an internal dimension of 1280, and the DFN CLIP projection isn't applied to the spatial features. So that's why there's the mismatch.

Is the dense clip_features well-aligned with the CLIP space, and can I use it for open-vocabulary tasks?

I am aware of one work that's under review at CVPR that's using the semantic alignment for 3D open vocabulary segmentation. A different model to consider would be the recently proposed dino.txt as they explicitly align the text and spatial features.

@OasisArtisan
Copy link

OasisArtisan commented Apr 10, 2025

This #81 (comment) might also be what you want. Its a method to get dense language aligned features out of RADIO at no additional computational cost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants