-
Notifications
You must be signed in to change notification settings - Fork 1.4k
a better solution for mismatch of speech feat len and speech token len when trainning #1232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
这个是验证过吗,比删掉最后一帧训的效果要好吗? |
尚未对比,主要是deyi的版本,他是整个batch强制trim,会导致部分较长的trim超过一个token,这非常不优雅 |
我们训练时一直是按取整的方式去掉一帧,推理时也是一样,目前没有发现这个对效果有太大影响。当然复制一帧可能也没有什么影响。这个我感觉可能两个做法没什么差异。 |
flow训练时,是做了interpolate强制重采样对齐,并没有看到去掉一帧的策略?只在推理看到对prompt进行对齐的策略 |
其实这个interpolate可以删掉,因为提取mel的时候就已经确保了是2倍关系 |
哦我忘记加这个了,之前是准备在compute_fbank里加这个截断代码了,忘记加了。麻烦你把这个pr改成截断一帧 |
之前的设计是传个token2mel ratio的参数,根据这个参数强制截断,应该是我后来搞忘记了,忘记加了 |
好的,已更改 |
f6a8c80
to
74fea6a
Compare
74fea6a
to
038ff9f
Compare
refer to #1051
@aluminumbox