Skip to content

V4具体怎么在api.py或api_v2.py里使用呢? #2306

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
yangyuke001 opened this issue Apr 21, 2025 · 15 comments
Open

V4具体怎么在api.py或api_v2.py里使用呢? #2306

yangyuke001 opened this issue Apr 21, 2025 · 15 comments

Comments

@yangyuke001
Copy link

感谢大佬开源的好东西【送花花】,请问v4版本具体是怎么在接口里使用呢?

@dignome
Copy link

dignome commented Apr 22, 2025

The information commented at the top of api_v2.py is valid. GPT_SoVITS/configs/tts_infer.yaml contains the last configuration used when running the webui inference (from webui.py or from inference_webui_fast.py). So if you last ran webui inference with a v4 model it should be ready to go in tts_infer.yaml.

`

WebAPI文档

python api_v2.py -a 127.0.0.1 -p 9880 -c GPT_SoVITS/configs/tts_infer.yaml

执行参数:

 -a  -  绑定地址, 默认"127.0.0.1" 
 -p  -  绑定端口, 默认9880 
 -c  -  TTS配置文件路径, 默认"GPT_SoVITS/configs/tts_infer.yaml" 

调用:

推理

endpoint: /tts
GET:

http://127.0.0.1:9880/tts?text=先帝创业未半而中道崩殂,今天下三分,益州疲弊,此诚危急存亡之秋也。&text_lang=zh&ref_audio_path=archive_jingyuan_1.wav&prompt_lang=zh&prompt_text=我是「罗浮」云骑将军景元。不必拘谨,「将军」只是一时的身份,你称呼我景元便可&text_split_method=cut5&batch_size=1&media_type=wav&streaming_mode=true

POST:
json
{
"text": "", # str.(required) text to be synthesized
"text_lang: "", # str.(required) language of the text to be synthesized
"ref_audio_path": "", # str.(required) reference audio path
"aux_ref_audio_paths": [], # list.(optional) auxiliary reference audio paths for multi-speaker tone fusion
"prompt_text": "", # str.(optional) prompt text for the reference audio
"prompt_lang": "", # str.(required) language of the prompt text for the reference audio
"top_k": 5, # int. top k sampling
"top_p": 1, # float. top p sampling
"temperature": 1, # float. temperature for sampling
"text_split_method": "cut0", # str. text split method, see text_segmentation_method.py for details.
"batch_size": 1, # int. batch size for inference
"batch_threshold": 0.75, # float. threshold for batch splitting.
"split_bucket: True, # bool. whether to split the batch into multiple buckets.
"speed_factor":1.0, # float. control the speed of the synthesized audio.
"streaming_mode": False, # bool. whether to return a streaming response.
"seed": -1, # int. random seed for reproducibility.
"parallel_infer": True, # bool. whether to use parallel inference.
"repetition_penalty": 1.35 # float. repetition penalty for T2S model.
"sample_steps": 32, # int. number of sampling steps for VITS model V3.
"super_sampling": False, # bool. whether to use super-sampling for audio when using VITS model V3.
}

RESP:
成功: 直接返回 wav 音频流, http code 200
失败: 返回包含错误信息的 json, http code 400

命令控制

endpoint: /control

command:
"restart": 重新运行
"exit": 结束运行

GET:

http://127.0.0.1:9880/control?command=restart

POST:
json
{
"command": "restart"
}

RESP: 无

切换GPT模型

endpoint: /set_gpt_weights

GET:

http://127.0.0.1:9880/set_gpt_weights?weights_path=GPT_SoVITS/pretrained_models/s1bert25hz-2kh-longer-epoch=68e-step=50232.ckpt

RESP:
成功: 返回"success", http code 200
失败: 返回包含错误信息的 json, http code 400

切换Sovits模型

endpoint: /set_sovits_weights

GET:

http://127.0.0.1:9880/set_sovits_weights?weights_path=GPT_SoVITS/pretrained_models/s2G488k.pth

RESP:
成功: 返回"success", http code 200
失败: 返回包含错误信息的 json, http code 400

`

@yangyuke001
Copy link
Author

@dignome 感谢回复。我看到了tts_infer.yaml已经更新了v4模型,直接调用api_v2.py的结果合成出来的声音是很奇怪的,像是模型没匹配好。我的tts_infer.yaml如下:
custom:
bert_base_path: GPT_SoVITS/pretrained_models/chinese-roberta-wwm-ext-large
cnhuhbert_base_path: GPT_SoVITS/pretrained_models/chinese-hubert-base
device: cuda
is_half: true
t2s_weights_path: GPT_SoVITS/pretrained_models/s1v3.ckpt
version: v4
vits_weights_path: GPT_SoVITS/pretrained_models/gsv-v4-pretrained/s2Gv4.pth
v1:
bert_base_path: GPT_SoVITS/pretrained_models/chinese-roberta-wwm-ext-large
cnhuhbert_base_path: GPT_SoVITS/pretrained_models/chinese-hubert-base
device: cpu
is_half: false
t2s_weights_path: GPT_SoVITS/pretrained_models/s1bert25hz-2kh-longer-epoch=68e-step=50232.ckpt
version: v1
vits_weights_path: GPT_SoVITS/pretrained_models/s2G488k.pth
v2:
bert_base_path: GPT_SoVITS/pretrained_models/chinese-roberta-wwm-ext-large
cnhuhbert_base_path: GPT_SoVITS/pretrained_models/chinese-hubert-base
device: cpu
is_half: false
t2s_weights_path: GPT_SoVITS/pretrained_models/gsv-v2final-pretrained/s1bert25hz-5kh-longer-epoch=12-step=369668.ckpt
version: v2
vits_weights_path: GPT_SoVITS/pretrained_models/gsv-v2final-pretrained/s2G2333k.pth
v3:
bert_base_path: GPT_SoVITS/pretrained_models/chinese-roberta-wwm-ext-large
cnhuhbert_base_path: GPT_SoVITS/pretrained_models/chinese-hubert-base
device: cpu
is_half: false
t2s_weights_path: GPT_SoVITS/pretrained_models/s1v3.ckpt
version: v3
vits_weights_path: GPT_SoVITS/pretrained_models/s2Gv3.pth
v4:
bert_base_path: GPT_SoVITS/pretrained_models/chinese-roberta-wwm-ext-large
cnhuhbert_base_path: GPT_SoVITS/pretrained_models/chinese-hubert-base
device: cpu
is_half: false
t2s_weights_path: GPT_SoVITS/pretrained_models/s1v3.ckpt
version: v4
vits_weights_path: GPT_SoVITS/pretrained_models/gsv-v4-pretrained/s2Gv4.pth

@dignome
Copy link

dignome commented Apr 23, 2025

If there really is a difference you could most likely show this by setting a fixed/static seed value along with matching the other parameters when using both api-v2.py and inference_webui_fast.py. It should produce similar result.

For best speaker reproduction you should finetune a v4 model against a dataset containing at least 10 minutes of audio samples of that speaker using webui.py - then make sure those models are present in your config specified to api-v2.py -c <path/to/your/config.yaml>

@inktree
Copy link

inktree commented Apr 23, 2025

感觉奇怪是正常的,因为v4虽然和v3架构相同但是采样率是不一致的,如果你使用和v3相同的api调用参数是必定会出问题的,你可以自行更改相关部分

我目前自己改的处理逻辑如下

--- V3 Mel 函数定义 ---

mel_fn = lambda x: mel_spectrogram_torch(
x,
**{
"n_fft": 1024,
"win_size": 1024,
"hop_size": 256,
"num_mels": 100,
"sampling_rate": 24000,
"fmin": 0,
"fmax": None,
"center": False,
},
)

--- 添加 V4 Mel 函数定义 ---

mel_fn_v4 = lambda x: mel_spectrogram_torch(
x,
**{
"n_fft": 1280,
"win_size": 1280,
"hop_size": 320,
"num_mels": 100,
"sampling_rate": 32000, # V4 使用 32kHz Mel
"fmin": 0,
"fmax": None,
"center": False,
},
)

    elif version in {"v3", "v4"}: # 判断是否为 v3 或 v4
        logger.info(f"使用 V3/V4 解码逻辑 (vq_model.decode_encp + CFM/Vocoder)...")
        # --- V3/V4 解码逻辑 ---
        # 1. 确定目标采样率和 Mel 函数
        if model_version == "v4":
            tgt_sr = 32000
            current_mel_fn = mel_fn_v4
            logger.info(f"V4 模型:使用 {tgt_sr}Hz 采样率和 V4 Mel 函数。")
        else: # V3
            tgt_sr = 24000
            current_mel_fn = mel_fn
            logger.info(f"V3 模型:使用 {tgt_sr}Hz 采样率和 V3 Mel 函数。")

但是问题在于pyopenjtalk是炸的,最终调用日语还是会报错,很头疼

@xy3xy3
Copy link

xy3xy3 commented Apr 23, 2025

有比较好的解决方案了吗

@wangzai23333
Copy link

好像还是修改了tts_infer.yaml,运行api_v2.py会version自动变回v2,生成出来的音频应该是采样率不太对,会炸

@yangyuke001
Copy link
Author

yangyuke001 commented Apr 23, 2025

@wangzai23333 @inktree 是的。我自己也尝试去改tts_infer.yaml、api_v2.py里面相关内容,都没有良好的输出,所以想请花儿大佬完善一版api_v2.py 。:)

@dignome
Copy link

dignome commented Apr 24, 2025

So is your issue resolved? api_v2.py worked for you?

@YunZLu
Copy link

YunZLu commented Apr 24, 2025

好像还是修改了tts_infer.yaml,运行api_v2.py会version自动变回v2,生成出来的音频应该是采样率不太对,会炸

GPT_SoVITS/TTS_infer_pack/TTS.py
第290行的问题:
version = configs.get("version", "v2").lower()

现在的tts_infer.yaml中version不在根层级,所以get不到,用了默认值v2。最简单的办法是把这里改成v4然后提个bug。

@lucasmen9527
Copy link

lucasmen9527 commented Apr 25, 2025

python api_v2.py 即可 api_v2里面有对接口的详细的描述 注意 使用v4训练的模型 用api.py能跑起来 但是调用的时候会报错 audio, _ = librosa.load(filename, int(hps.data.sampling_rate))

Image

用api_v2跑 改了下GPT_SoVITS/configs/tts_infer.yaml 用自定义的模型就行了

Image Image Image

如下图所示 加载成功 看下请求文档 正常发送请求即可 可以使用 apipost apifox postman等接口测试工具测试

@yangyuke001
Copy link
Author

So is your issue resolved? api_v2.py worked for you?

not yet

@yangyuke001
Copy link
Author

好像还是修改了tts_infer.yaml,运行api_v2.py会version自动变回v2,生成出来的音频应该是采样率不太对,会炸

GPT_SoVITS/TTS_infer_pack/TTS.py 第290行的问题: version = configs.get("version", "v2").lower()

现在的tts_infer.yaml中version不在根层级,所以get不到,用了默认值v2。最简单的办法是把这里改成v4然后提个bug。

api_v2.py好像改不了sampling_rate,音频出来还是怪得很

@YunZLu
Copy link

YunZLu commented Apr 25, 2025

好像还是修改了tts_infer.yaml,运行api_v2.py会version自动变回v2,生成出来的音频应该是采样率不太对,会炸

GPT_SoVITS/TTS_infer_pack/TTS.py 第290行的问题: version = configs.get("version", "v2").lower()
现在的tts_infer.yaml中version不在根层级,所以get不到,用了默认值v2。最简单的办法是把这里改成v4然后提个bug。

api_v2.py好像改不了sampling_rate,音频出来还是怪得很

我看GPT_SoVITS/TTS_infer_pack/TTS.py的using_vocoder_synthesis函数中已经把v4采样率设为32000了吧,如果version为v4,采样率应该是对的啊?

@yangyuke001
Copy link
Author

好像还是修改了tts_infer.yaml,运行api_v2.py会version自动变回v2,生成出来的音频应该是采样率不太对,会炸

GPT_SoVITS/TTS_infer_pack/TTS.py 第290行的问题: version = configs.get("version", "v2").lower()
现在的tts_infer.yaml中version不在根层级,所以get不到,用了默认值v2。最简单的办法是把这里改成v4然后提个bug。

api_v2.py好像改不了sampling_rate,音频出来还是怪得很

我看GPT_SoVITS/TTS_infer_pack/TTS.py的using_vocoder_synthesis函数中已经把v4采样率设为32000了吧,如果version为v4,采样率应该是对的啊?

我看作者说V4采样率是48k啊:

“(4)v4修复了v3非整数倍上采样可能导致的电音问题,原生输出48k音频防闷(而v3原生输出只有24k)。作者认为v4是v3的平替,更多还需测试”

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants