Skip to content

Questions About Dense CLIP Features Alignment and Dimensionality in RADIO for Open-Vocabulary Tasks #128

Open
@Chuan-10

Description

@Chuan-10

Hi, this is really exciting work, and thank you for providing such a comprehensive GitHub repo!

I'm currently trying to use RADIO to extract dense (patch-level) CLIP features from an image for open-vocabulary tasks. My code looks like this:

model_version = "radio_v2.5-g"
model = torch.hub.load('NVlabs/RADIO', 'radio_model', version=model_version, adaptor_names='clip', progress=True, skip_validation=True)

model.eval()

image_path = 'DSCF5857.JPG'
image = Image.open(image_path).convert('RGB')

image = pil_to_tensor(image).to(dtype=torch.float32)
image.div_(255.0)  # RADIO expects input values to be between 0 and 1
image = image.unsqueeze(0)  # Add batch dimension
nearest_res = model.get_nearest_supported_resolution(*image.shape[-2:])
image = F.interpolate(image, nearest_res, mode='bilinear', align_corners=False)

with torch.no_grad():
    output = model(image, feature_fmt='NCHW')
    bb_summary, bb_features = output['backbone']
    clip_summary, clip_features = output['clip']
    lip_adaptor = model.adaptors['clip']
    text_inputs = ['metal', 'wood']
    tokens = clip_adaptor.tokenizer(text_inputs)
    clip_text_embeddings = clip_adaptor.encode_text(tokens)

I have the following questions:

  1. Why is the feature dimension of clip_summary 1024 (the same as clip_text_embeddings), but the feature dimension of clip_features is 1280?
  2. Is the dense clip_features well-aligned with the CLIP space, and can I use it for open-vocabulary tasks?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions