-
Notifications
You must be signed in to change notification settings - Fork 5.3k
GPT‐SoVITS‐features (各版本特性)
语种支持(可互相跨语种合成) | GPT训练集时长 | SoVITS训练集时长 | 推理速度 | 参数量 | 功能 | |
---|---|---|---|---|---|---|
v1 | 中日英 | 约2k小时 | 约2k小时 | baseline | 90M+77M | baseline |
v2 | 中日英韩粤 | 约2.5k小时 | vq encoder约2k小时(v1冻结),一共5k小时 | 翻倍 | 90M+77M | 新增语速调节,无参考文本模式,更好的混合语种切分 |
v3 | 中日英韩粤 | 约7k小时 | vq encoder约2k小时(v1冻结),一共7k小时 | 约等于v2 | 330M+77M | 大幅增加zero shot相似度;情绪表达、微调性能提升 |
v4 | 同楼上 | 同楼上 | 同楼上 | 同楼上 | 同楼上 | 修复了v3非整数倍上采样可能导致的电音问题,原生输出48k音频防闷 |
v2Pro | 同v2 | 同v3 | 同v3 | 同v3 | 133M+77M | 大幅增加zero shot相似度;情绪表达、微调性能提升;比v2显存占用稍高一点点,超v4的性能,v2的硬件成本和速度 |
v2ProPlus | 同v2 | 同v3 | 同v3 | 同v3 | 152M+77M | 大幅增加zero shot相似度;情绪表达、微调性能提升;比v2Pro显存占用稍高一点点,超v4的性能,v2的硬件成本和速度 |
Language Support (Cross-language synthesis) | GPT Training Dataset Duration | SoVITS Training Dataset Duration | Inference Speed | Number of Parameters | Features | |
---|---|---|---|---|---|---|
v1 | Chinese, Japanese, English | about 2k hours | about 2k hours | baseline | 90M+77M | baseline |
v2 | Chinese, Japanese, English, Korean, Cantonese | about 2.5k hours | vq encoder about 2k hours (frozen from v1),5k hours in total | doubled | 90M+77M | Added speed control, reference-free mode, better mixed-language slices |
v3v4 | ~v2 | about 7k hours | vq encoder about 2k hours (frozen from v1),7k hours in total | ~v2 | 330M+77M | Significant enhancement in zero-shot similarity; improvements in emotional expression and fine-tuning performance. |
v2Pro | ~v2 | ~v3v4 | ~v3v4 | ~v2 | 133M+77M | Significant enhancement in zero-shot similarity; improvements in emotional expression and fine-tuning performance. Slightly higher VRAM usage than v2, surpassing v4's performance, with v2's hardware cost and speed. |
v2ProPlus | ~v2 | ~v3v4 | ~v3v4 | ~v2 | 152M+77M | Significant enhancement in zero-shot similarity; improvements in emotional expression and fine-tuning performance. Slightly higher VRAM usage than v2Pro, surpassing v4's performance, with v2's hardware cost and speed. |
(1)从benchmark跑分看,没有必要再用v3v4,因为v2Pro系列和v2硬件需求一样,但是zero shot相似度和v3v4同一水平线;
Based on benchmark scores, there is no need to continue using v3/v4, as the v2Pro series has the same hardware requirements as v2 but achieves zero-shot similarity on par with v3/v4.
(2)v1v2和v2Pro系列是同样的特性,v3v4是同样的特性。音质一般的训练集用v1v2v2Pro能得到不错的效果,但是用v3v4不行;v3v4的合成语气音色更偏向参考音频而非训练集整体。
v1/v2 and the v2Pro series share the same characteristics, while v3/v4 have similar features. For training sets with average audio quality, v1/v2/v2Pro can deliver decent results, but v3/v4 cannot. Additionally, the synthesized tone and timebre of v3/v4 lean more toward the reference audio rather than the overall training set.
(1)音色相似度更像,需要更少训练集来逼近本人(不训练直接使用底模的模式下音色相似性提升更大)
The timbre similarity is higher, requiring less training data to approximate the target speaker (the timbre similarity is significantly improved using the base model directly without fine-tuning).
(2)GPT合成更稳定,重复漏字(根据测试集实验指标)更少,也更容易跑出丰富情感
The GPT model is more stable, with fewer repetitions and omissions, and it is easier to generate speech with richer emotional expression.
(3)比v2更忠实于参考音频。微调场景下,v2比v3更受训练集整体平均影响,然后带一些参考音频的引导。
Compared to v2model, v3model is more faithful to the reference audio. In fine-tuning scenarios, V2 is more influenced by the overall average of the training dataset, with some guidance from the reference audio.
如果你的训练集质量比较糟糕,也许“更受训练集整体平均影响”的v2vits版本更适合你。
If your training dataset is of poor quality, the V2 (ViTS) version, which is 'more influenced by the overall average of the training dataset,' might be more suitable for you.
(4)v4修复了v3非整数倍上采样可能导致的电音问题,原生输出48k音频防闷(而v3原生输出只有24k)。作者认为v4是v3的平替。
Version 4 fixes the issue of metallic artifacts in Version 3 caused by non-integer multiple upsampling, and natively outputs 48k audio to prevent muffled sound (whereas Version 3 only natively outputs 24k audio). The author considers Version 4 a direct replacement for Version 3, though further testing is still needed.
WER | SIM | |
---|---|---|
v1 | 0.025 | 0.526 |
v2 | 0.017 | 0.549 |
v3(8steps) | 0.014 | 0.702 |
v4(8steps) | 0.013 | 0.735 |
v2Pro | 0.016 | 0.709 |
v2ProPlus | 0.016 | 0.737 |
GT | 0.013 | 0.750 |
这是啥:字节豆包团队发的SeedTTS论文工作给的中文测试集benchmark。
What is this?
This is the Chinese TTS testset benchmark from the SeedTTS paper work released by the ByteDance Seed team.
我咋测的:用的https://github.com/BytedanceSpeech/seed-tts-eval 这里面官方给的相似度模型和ASR模型跑的
How did I test it?
I used the official similarity model and ASR model provided in https://github.com/BytedanceSpeech/seed-tts-eval to run the evaluation.
这个benchmark有啥用:测合成发音和目标发音的匹配度(WER),和音色相似度(SIM)。测不了自然性、情感丰富性和音质,前2个要人工打分。同时测试集的音色也有局限性。分数仅供参考,如果需要准确的结论请你用自己的实际场景训练集去微调和测试集去测试。论文里的分数、业界的情报和媒体宣传都是浮云,请永远相信,只有你自己的测试结论才是最真实的。
What’s the purpose of this benchmark?
It measures the pronunciation matching (WER) and timbre similarity (SIM) between synthesized and target speech. It cannot evaluate naturalness, emotional richness, or audio quality—the first two require human scoring. Additionally, the timbre in the test set has limitations. The scores are for reference only. If you need accurate conclusions, fine-tune using your own training data and test it on your own evaluation set. The scores in the paper, industry insights, and media hype are all just noise—always trust your own test results as the only reliable truth.
GT指标和"WER"这个英文都是从SeedTTS论文https://arxiv.org/pdf/2406.02430v1 的table10里抄的。GT指测试集原始说话人的真实说话语音。
The "GT" metric and the term "WER" are taken from Table 10 in the SeedTTS paper: https://arxiv.org/pdf/2406.02430v1. GT refers to the real speech of the original speaker in the test set.
GT的作用是什么?
What does GT represent?
GT反应了benchmark的测量精度。WER基于语音识别(ASR)的结果,ASR模型自带误差,源说话人说话没有口吃,因此GT的WER是最优的,低于GT的WER对于反应TTS稳定性没有意义;SIM基于模型声纹鉴定(SV)的结果,源说话人正常连续说话,自己就是最像源说话人自己的,超过GT的SIM指标对于反应TTS音色相似度同样没有意义。
GT reflects the measurement accuracy of the benchmark. WER (Word Error Rate) is based on the results of automatic speech recognition (ASR). Since ASR models inherently have errors, and the original speaker does not stutter, the WER of GT is optimal. A WER lower than GT's has no meaningful implications for assessing the stability of TTS (Text-to-Speech). Similarly, SIM (Similarity) is based on the results of speaker verification (SV). The original speaker speaks naturally and continuously, so their own voice is the most similar to themselves. A SIM score exceeding GT's is likewise meaningless for evaluating the timbre similarity of TTS.
因此,对于对比观察benchmark的用户来说,超过的0.75的SIM,可以自动截断到0.75,低于0.013的WER,可以自动截断到0.13。如果有人说某模型WER值到达了0.007~0.008,因此比0.013的WER模型更稳定:这说明这个人在忽悠你;相似度同理。
Therefore, for users comparing benchmarks, a SIM score exceeding 0.75 can be automatically truncated to 0.75, and a WER lower than 0.013 can be truncated to 0.013. If someone claims that a model achieves a WER of 0.007–0.008 and is therefore more stable than a model with a WER of 0.013, they are misleading you. The same logic applies to similarity.
为什么合成的语音的WER可能会低于GT的WER:这说明TTS模型的训练集包含了用于测量WER指标的ASR模型(中文是魔搭paraformer large,英文是whisper large)的相似训练集的比例较高。可能2个情况导致:
Why might the WER of synthesized speech be lower than GT's WER? This indicates that the training dataset of the TTS model includes a proportion of data similar to the training set of the ASR model used for measuring WER (Paraformer Large for Chinese, Whisper Large for English). Two possible scenarios could lead to this:
(1)TTS合成出来情感、语调起伏特别平淡,比GT更平淡,那么ASR识别TTS合成出来的语音难度更低;
The synthesized speech from the TTS model has particularly flat emotion and intonation, even flatter than GT, making it easier for the ASR system to recognize.
(2)TTS模型训练集时长特别大,自然更容易命中ASR的相似训练子集
The TTS model's training dataset is exceptionally large, increasing the likelihood of overlapping with subsets similar to the ASR model's training data.
但这样的WER指标升级(指在WER已经小于GT的情况下继续降低WER)对于衡量TTS的稳定性是否提升没有意义。
However, such "improvements" in WER (where WER is already lower than GT's) are meaningless for evaluating whether the stability of TTS has actually improved.
为什么合成的语音的SIM可能会高于GT的SIM:SIM本身也不完全等同于音色相似度。同一个说话人前后说两句,人耳一听就是同一个人,但是模型能测量出这2句的微小差别,但是这个微小差别并不是说明哪一句更好。
When the same speaker utters two sentences consecutively, the human ear perceives them as having identical timbre, but the model may detect subtle acoustic differences between the two. However, these minor discrepancies do not indicate which version is objectively better.
不同版本gpt-sovits合成的SeedTTS中文测试集结果放在百度网盘:
The results of the SeedTTS Chinese testset synthesized by different versions of GPT-SoVITS have been uploaded to BaiduNetdisk.
https://pan.baidu.com/s/1Fd5xjVzVa2LhI-b-FSxo8w?pwd=yp6g 提取码: yp6g
1、v2对比v1的技术变更 (v2 vs v1)
(1)增加了一点训练集时长
Slight increase in training dataset duration
(2)文本前端新增2个语种
Text frontend expanded with 2 additional languages
(3)音色编码器抛弃最高频
Timebre encoder discards the highest frequency band
向后的版本都继承了这几个特性。
All subsequent versions inherit these features.
2、v3v4对比v1v2的技术变更 (v3v4 vs v1v2)
(1)训练集增加至7k小时 (MOS分音质过滤、标点停顿校验)
The training dataset has been expanded to 7,000 hours (with MOS-based audio quality filtering and punctuation pause verification).
只使用7k小时优选训练集,更大的想象空间留给各位看官们发挥~
Only 7,000 hours of training data were used, leaving more room for imagination and creativity for the audience to explore.
(2)s2结构变更为:shortcut Conditional Flow Matching Diffusion Transformers (shortcut-CFM-DiT)
The S2 architecture has been modified to shortcut-CFM-DiT.
由于s2占整体延时比例太低,s2变复杂对于整体耗时影响不大。
Since the proportion of S2 in the overall latency is minimal, increasing the complexity of S2 has little impact on the total processing time.
音质最佳:采样步数32
Best Audio Quality: Sampling steps set to 32.
速度快:4/8步 (zero shot这档配置没啥瑕疵,少量样本微调可能需要提升步数)
Faster Speed: 4/8 steps (zero-shot configuration shows no significant flaws, though fine-tuning with a small number of samples may require increasing the steps).
s2原理的变更(基于参考音频扩散补全)导致音色相似度大幅提升。
The principle of S2 has been updated (based on reference audio diffusion outpainting), resulting in a significant improvement in timbre similarity.
由于没用端到端合成,v3使用了开源的24k的BigVGANv2参数从mel谱得到波形,
As end-to-end synthesis is not utilized, the open-source 24k BigVGANv2 parameters are employed to generate waveforms from mel-spectrograms,
但是代价是,如果想用开源的vocoder,只能遵从他们的hop参数,导致适配SSL的hop必须进行非整数倍上采样,造成了低采样步数+小样本(指100h以下)大量微调情况下可能出现电音。因此v4是作者自己训练的一版声码器,同时顺手将输出采样率从v3的24k提升到48k,不再需要后置超分网络防闷。
The cost, however, is that using an open-source vocoder requires adhering to its fixed hop size, forcing non-integer upsampling when adapting SSL's hop size. This can introduce metallic artifacts when fine-tuning extensively with small datasets (under 100 hours) and low sampling steps.
To resolve this, V4 employs a custom-trained vocoder by the author, while conveniently upgrading the output sample rate from V3's 24k to 48k—eliminating the need for a post-processing super-resolution network to prevent muffled sound.
(3)s1结构不变,更新了一版参数
The S1 architecture remains unchanged, with the parameters updated.
3、v2Pro系列对比v1v2的技术变更 (v2Pro/v2ProPlus vs v2)
(1)s2增加SV音色emb引导;
S2 adds SV speaker embedding guidance.
(2)输入模型的音色emb提升到1024通道;
The dimension of speaker embedding input to s2 model is increased to 1024 channels.
(3)v2ProPlus增加s2的decoder宽度。
V2ProPlus increases the width of convolutional layers in S2 decoder.