-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Description
In video_inference.py
, there is an inconsistency between video and audio processing when handling videos longer than 150 seconds. Specifically, the logic for computing duration for video and audio is inconsistent.
With a video of length t
(where t
is a fractional value slightly above 150), the following issue occurs:
-
The video sampling logic starts with 150 frames.
-
The audio duration is computed as 151.
-
Then, at line 129,
int()
rounds downnum_frames
to 149. -
This mismatch ultimately causes a size error when padding the audio_embeds tensor.
My current workaround is to modify line 47 from:
duration = vlen / input_fps
to
duration = math.ceil(vlen / input_fps)
Maybe there's a better way to fix the bug? Looking forward to your reply.
Metadata
Metadata
Assignees
Labels
No labels