A bug to fix in the video_inference.py

In [`video_inference.py`](https://github.com/TencentARC/ARC-Hunyuan-Video-7B/blob/master/video_inference.py), there is an inconsistency between video and audio processing when handling videos longer than 150 seconds. Specifically, the logic for computing duration for video and audio is inconsistent.

With a video of length `t` (where `t` is a fractional value slightly above 150), the following issue occurs:

- The video sampling logic starts with 150 frames.

- The audio duration is computed as 151.

- Then, at [line 129](https://github.com/TencentARC/ARC-Hunyuan-Video-7B/blob/master/video_inference.py#L129), `int()` rounds down `num_frames` to 149.

- This mismatch ultimately causes a size error when padding the audio_embeds tensor.

My current workaround is to modify line 47 from:
`duration = vlen / input_fps
`
to
`duration = math.ceil(vlen / input_fps)
`

Maybe there's a better way to fix the bug? Looking forward to your reply. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

A bug to fix in the video_inference.py #19

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

A bug to fix in the video_inference.py #19

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions