Open
Description
Hi, this is really exciting work, and thank you for providing such a comprehensive GitHub repo!
I'm currently trying to use RADIO to extract dense (patch-level) CLIP features from an image for open-vocabulary tasks. My code looks like this:
model_version = "radio_v2.5-g"
model = torch.hub.load('NVlabs/RADIO', 'radio_model', version=model_version, adaptor_names='clip', progress=True, skip_validation=True)
model.eval()
image_path = 'DSCF5857.JPG'
image = Image.open(image_path).convert('RGB')
image = pil_to_tensor(image).to(dtype=torch.float32)
image.div_(255.0) # RADIO expects input values to be between 0 and 1
image = image.unsqueeze(0) # Add batch dimension
nearest_res = model.get_nearest_supported_resolution(*image.shape[-2:])
image = F.interpolate(image, nearest_res, mode='bilinear', align_corners=False)
with torch.no_grad():
output = model(image, feature_fmt='NCHW')
bb_summary, bb_features = output['backbone']
clip_summary, clip_features = output['clip']
lip_adaptor = model.adaptors['clip']
text_inputs = ['metal', 'wood']
tokens = clip_adaptor.tokenizer(text_inputs)
clip_text_embeddings = clip_adaptor.encode_text(tokens)
I have the following questions:
- Why is the feature dimension of clip_summary 1024 (the same as clip_text_embeddings), but the feature dimension of clip_features is 1280?
- Is the dense clip_features well-aligned with the CLIP space, and can I use it for open-vocabulary tasks?
Metadata
Metadata
Assignees
Labels
No labels