Not able to reproduce the stage two models



during model loading the check points weight and vocab size seems to be wrong
below is the code I used to generate this result, which has also been mentioned by others, I also tried clip and other models, the reuslts seems to be pretty bad when transferring to other dataset

text: A man in a gray sweater plays fetch with his dog in the snowy yard, throwing a toy and watching it run. ~ prob: 0.6796
text: A man in a gray hat and coat walks through the snowy yard, carefully navigating around the trees. ~ prob: 0.0944
text: A person dressed in a blue jacket shovels the snow-covered pavement outside their house. ~ prob: 0.0754
text: A person stands on the snowy floor, pushing a sled loaded with blankets, preparing for a fun-filled ride. ~ prob: 0.0375
text: A playful dog slides down a snowy hill, wagging its tail with delight. ~ prob: 0.0288

Looking for your guidance.


`import numpy as np
import os
import io
import cv2
os.environ['CUDA_LAUNCH_BLOCKING']='1'
import torch

from demo_config import (Config,
                    eval_dict_leaf)

from demo.utils import (retrieve_text,
                  _frame_from_video,
                  setup_internvideo2)
seed = 4491734
print("Seed:", seed)

np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)
video = cv2.VideoCapture('demo/example1.mp4')
frames = [x for x in _frame_from_video(video)]
text_candidates = ["A playful dog and its owner wrestle in the snowy yard, chasing each other with joyous abandon.",
                   "A man in a gray coat walks through the snowy landscape, pulling a sleigh loaded with toys.",
                   "A person dressed in a blue jacket shovels the snow-covered pavement outside their house.",
                   "A pet dog excitedly runs through the snowy yard, chasing a toy thrown by its owner.",
                   "A person stands on the snowy floor, pushing a sled loaded with blankets, preparing for a fun-filled ride.",
                   "A man in a gray hat and coat walks through the snowy yard, carefully navigating around the trees.",
                   "A playful dog slides down a snowy hill, wagging its tail with delight.",
                   "A person in a blue jacket walks their pet on a leash, enjoying a peaceful winter walk among the trees.",
                   "A man in a gray sweater plays fetch with his dog in the snowy yard, throwing a toy and watching it run.",
                   "A person bundled up in a blanket walks through the snowy landscape, enjoying the serene winter scenery."]
#%%
config = Config.from_file('demo/internvideo2_stage2_config.py')
config = eval_dict_leaf(config)
#%%

# config['pretrained_path'] = '/InternVideo/InternVideo2/multi_modality/weights/InternVideo2-stage2_1b-224p-f4.pt',
intern_model, tokenizer = setup_internvideo2(config)
#%%
texts, probs = retrieve_text(frames, text_candidates, model=intern_model.eval(), topk=5, config=config)

for t, p in zip(texts, probs):
    print(f'text: {t} ~ prob: {p:.4f}')`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Not able to reproduce the stage two models #304

config['pretrained_path'] = '/InternVideo/InternVideo2/multi_modality/weights/InternVideo2-stage2_1b-224p-f4.pt',

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Not able to reproduce the stage two models #304

Description

config['pretrained_path'] = '/InternVideo/InternVideo2/multi_modality/weights/InternVideo2-stage2_1b-224p-f4.pt',

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions