Skip to content

a better solution for mismatch of speech feat len and speech token len when trainning #1232

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

boji123
Copy link
Contributor

@boji123 boji123 commented Apr 24, 2025

refer to #1051
@aluminumbox

@boji123 boji123 changed the title a better solution for mismatch of speech feat len and speech token len a better solution for mismatch of speech feat len and speech token len when trainning Apr 24, 2025
@aluminumbox
Copy link
Collaborator

这个是验证过吗,比删掉最后一帧训的效果要好吗?

@boji123
Copy link
Contributor Author

boji123 commented Apr 24, 2025

尚未对比,主要是deyi的版本,他是整个batch强制trim,会导致部分较长的trim超过一个token,这非常不优雅
而且原版的flow里面为了对齐多了个interpolate,这会人为引入帧级别误差,也非常不优雅
我这个直接在单sample上做,就两个选项嘛,pad to longest 或者 trim to shortest,我这边再对比下

@aluminumbox
Copy link
Collaborator

尚未对比,主要是deyi的版本,他是整个batch强制trim,会导致部分较长的trim超过一个token,这非常不优雅 而且原版的flow里面为了对齐多了个interpolate,这会人为引入帧级别误差,也非常不优雅 我这个直接在单sample上做,就两个选项嘛,pad to longest 或者 trim to shortest,我这边再对比下

我们训练时一直是按取整的方式去掉一帧,推理时也是一样,目前没有发现这个对效果有太大影响。当然复制一帧可能也没有什么影响。这个我感觉可能两个做法没什么差异。

@boji123
Copy link
Contributor Author

boji123 commented Apr 25, 2025

flow训练时,是做了interpolate强制重采样对齐,并没有看到去掉一帧的策略?只在推理看到对prompt进行对齐的策略

@aluminumbox
Copy link
Collaborator

flow训练时,是做了interpolate强制重采样对齐,并没有看到去掉一帧的策略?

其实这个interpolate可以删掉,因为提取mel的时候就已经确保了是2倍关系

@aluminumbox
Copy link
Collaborator

flow训练时,是做了interpolate强制重采样对齐,并没有看到去掉一帧的策略?

其实这个interpolate可以删掉,因为提取mel的时候就已经确保了是2倍关系

哦我忘记加这个了,之前是准备在compute_fbank里加这个截断代码了,忘记加了。麻烦你把这个pr改成截断一帧

@aluminumbox
Copy link
Collaborator

flow训练时,是做了interpolate强制重采样对齐,并没有看到去掉一帧的策略?

其实这个interpolate可以删掉,因为提取mel的时候就已经确保了是2倍关系

哦我忘记加这个了,之前是准备在compute_fbank里加这个截断代码了,忘记加了。麻烦你把这个pr改成截断一帧

之前的设计是传个token2mel ratio的参数,根据这个参数强制截断,应该是我后来搞忘记了,忘记加了

@boji123
Copy link
Contributor Author

boji123 commented Apr 25, 2025

好的,已更改

@boji123 boji123 force-pushed the bj_dev_feat_len_pad branch from f6a8c80 to 74fea6a Compare April 25, 2025 02:36
@boji123 boji123 force-pushed the bj_dev_feat_len_pad branch from 74fea6a to 038ff9f Compare April 25, 2025 02:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants