How to run Bert-VITS2 on Intel CPU, iGPU, dGPU #396

feng-intel · 2024-08-28T10:34:24Z

feng-intel
Aug 28, 2024

How to run Bert-VITS2 on Intel CPU, iGPU, dGPU

Experienced platforms:

Intel MTL Ultra 7 155H, CPU
Intel MTL Ultra 7 155H, iGPU
Intel MTL Ultra 5 125, CPU
Intel MTL Ultra 5 125, iGPU
dGPU like ARC770, Flex xxx, Max xxx should be OK.

Focus on webui.py running.

1. Steps

Get code.

$ git clone https://github.com/fishaudio/Bert-VITS2
$ git checkout Extra-v2 -b Extra-v2

Create conda env and install ipex, or use intel ipex xpu docker image
Refer to https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/installation.html
Change requirements.txt and install
- Onnxruntime-gpu --> onnxruntime
- Opencc==1.1.6 --> opencc==1.1.7
Then: $ pip install -r requirements.txt

Download models

$ pip install openi
$ openi model download Stardust_minus/Bert-VITS2 Erlangshen-MegatronBert-1.3B --save_path ./Erlangshen-MegatronBert-1.3B
$ openi model download Stardust_minus/Bert-VITS2 Bert-VITS2_中文特化底模 --save_path ./Bert-VITS2_中文特化底模
$ openi model download Stardust_minus/Bert-VITS2 clap-htsat-fused --save_path ./clap-htsat-fused

# And move config.json
$ mv configs/config.json ./Data/

Set webui config in default_config.yml as the below.

    webui:
      # 推理设备
      device: "xpu"
      # 模型路径
      model: "Bert-VITS2_中文特化底模/G_0.pth"
      # 配置文件路径
      config_path: "./config.json"
      # 端口号
      port: 7860
      # 是否公开部署，对外网开放
      share: false
      # 是否开启debug模式
      debug: false
      # 是否开启fp16推理，开启后可减少~45%的显存占用
      fp16_run: true
      # 语种识别库，可选langid, fastlid
          language_identification_library: "langid"

Notes：

device "xpu" means it will run on Intel dGPU or iGPU according to their startup order. "cpu" means it will run on Intel CPU.
fp16_run: true, will affect 2 models: "Erlangshen-MegatronBert-1.3B-Chinese" and "emotional/clap-htsat-fused/". You can grep "webui_config.fp16_run" to check.

Fix code

$ git diff  g2pW/pypinyin_G2pW_bv2/g2pw.py
diff --git a/g2pW/pypinyin_G2pW_bv2/g2pw.py b/g2pW/pypinyin_G2pW_bv2/g2pw.py
index f236aae..3f78a5d 100644
--- a/g2pW/pypinyin_G2pW_bv2/g2pw.py
+++ b/g2pW/pypinyin_G2pW_bv2/g2pw.py
@@ -5,8 +5,8 @@ from pypinyin.core import Pinyin, Style
 from pypinyin.seg.simpleseg import simple_seg
 from pypinyin.converter import UltimateConverter
 from pypinyin.contrib.tone_convert import to_tone
-from .g2pw1 import G2PWOnnxConverter
-
+#from .g2pw1 import G2PWOnnxConverter
+from .g2pw1.onnx_api import G2PWOnnxConverter

 class G2PWPinyin(Pinyin):
     def __init__(

diff --git a/infer.py b/infer.py
index 5795b8b..3a173d6 100644
--- a/infer.py
+++ b/infer.py
@@ -11,6 +11,8 @@ import torch
 import commons
 from text import cleaned_text_to_sequence, get_bert

+import intel_extension_for_pytorch as ipex

Run

$ python webui.py -y default_config.yml

2. Performance improvement

Set manual seed.
Here troch.randn will affect the output of duration predictor. The y_lengths tensor value will be variable for the same text input. It will make the input dynamic shape for self.flow (TransformerCouplingBlock) and self.dec (Generator) . Dynamic shape will make performance drop. (We are continuously optimizing it)
If we set torch.manual_seed(seed=seed) before randn, it can make the duration predictor output fixed for the same length of text input. The reference code is in the below.
Enable fp16 for part of model "Bert-VITS2_中文特化底模/G_0.pth". The reference code is in the below.

  $ git diff ./models.py
diff --git a/models.py b/models.py
index 301cc3c..7bd56b3 100644
--- a/models.py
+++ b/models.py
@@ -247,6 +247,7 @@ class StochasticDurationPredictor(nn.Module):
         else:
             flows = list(reversed(self.flows))
             flows = flows[:-2] + [flows[-1]]  # remove a useless vflow
+            torch.manual_seed(seed=42)
             z = (
                 torch.randn(x.size(0), 2, x.size(2)).to(device=x.device, dtype=x.dtype)
                 * noise_scale
@@ -1097,6 +1098,15 @@ class SynthesizerTrn(nn.Module):
         )  # [b, t', t], [b, t, d] -> [b, d, t']

         z_p = m_p + torch.randn_like(m_p) * torch.exp(logs_p) * noise_scale
-        z = self.flow(z_p, y_mask, g=g, reverse=True)
-        o = self.dec((z * y_mask)[:, :, :max_len], g=g)
+        self.flow = self.flow.to(torch.float16)
+        z_p_fp16 = z_p.to(torch.float16)
+        y_mask_fp16 = y_mask.to(torch.float16)
+        g_fp16 = g.to(torch.float16)
+        #z = self.flow(z_p, y_mask, g=g, reverse=True)
+        z_fp16 = self.flow(z_p_fp16, y_mask_fp16, g=g_fp16, reverse=True)
+        z = z_fp16.to(torch.float32)
+        self.dec = self.dec.to(torch.float16)
+        o = self.dec((z_fp16 * y_mask_fp16)[:, :, :max_len], g=g_fp16)
+        o = o.to(torch.float32)
         return o, attn, y_mask, (z, z_p, m_p, logs_p)

3. Models

emotional/clap-htsat-fused

Docs:
Input emotional text, such as happy, output audio embedding
CLAP paper https://arxiv.org/abs/2211.06687
CLIP https://github.com/openai/CLIP
./emotional/clap-htsat-fused/README.md

model load and inference

## model load
infer.py
     emo = get_clap_text_feature(emotion, device)
       --> clap_wrapper.py    def get_clap_text_feature(text, device=config.bert_gen_config.device):
                    models[device] = ClapModel.from_pretrained(LOCAL_PATH)
                    processor = ClapProcessor.from_pretrained(LOCAL_PATH)

## model inference
clap_wrapper.py
        inputs = processor(text=text, return_tensors="pt").to(device)
        emb = models[device].get_text_features(**inputs).float()

g2pw

提升 pypinyin 的准确性。支持多种拼音风格。

Example

>> from pypinyin import lazy_pinyin, Style
>> style = Style.TONE3
>> print(lazy_pinyin('聪明的小兔子', style=style))
          ['cong1', 'ming2', 'de', 'xiao3', 'tu4', 'zi']

model load and inference

## model load
 Onnx_api.py    class G2PWOnnxConverter
                 self.session_g2pw = onnxruntime.InferenceSession(
                os.path.join(model_dir, "g2pW.onnx"),
                sess_options=sess_options,
                providers=["CUDAExecutionProvider"],
            )

## model inference
infer.py get_text()   --> clearn_text() 
        cleaner.py   clearn_text() -> language_module.g2p(norm_text)
            chinese.py   _g2p(sentences)  --> pinyinPlus.lazy_pinyin()
                  site-packages/pypinyin  core.py   lazy_pinyin()   pinyin()   self._converter.convert()
                     g2pw.py   convert()  pys = self._to_pinyin()  ->  self._g2pw(han)
                        onnx_api.py   __call__()  --> predict()
 
 在 chinese.py 里整个模型调了两遍！ 比较耗时
      orig_initials = pinyinPlus.lazy_pinyin(
            allWords, neutral_tone_with_five=True, style=Style.INITIALS
        )
      orig_finals = pinyinPlus.lazy_pinyin(
            allWords, neutral_tone_with_five=True, style=Style.FINALS_TONE3
        )

Erlangshen-MegatronBert
中文 BERT 模型，这个模型的编码器结构为主，专注于解决各种自然语言理解任务。它同时，鉴于中文语法和大规模训练的难度，使用了四种预训练策略来改进 BERT，Erlangshen-MegatronBert 模型适用于各种自然语言理解任务，包括文本生成、文本分类、问答等，这个模型的权重和代码都是开源的，可以在 Hugging Face 和 CSDN 博客等平台上找到。Erlangshen-MegatronBert 模型可以应用于多种领域，如 AI 模拟声音、数字人虚拟主播等。有三个参数选择，有710m和1.3b以及3.9B, 这里选择了居中的1.3b大模型。

输入 text，输出 res： len(res["hidden_states"]) = 25， res["hidden_states"][0].shape = [1, 6, 2048]

model load and inference

## model load
Infer.py
    get_text()
         clean_text()
         get_bert()
       text/__init__.py    get_bert() -> get_bert_feature()
           text/chinese.py
            text/chinese_bert.py  
              get_bert_feature()
                  LOCAL_PATH = "./bert/Erlangshen-MegatronBert-1.3B-Chinese"
                  models[device] = MegatronBertModel.from_pretrained(LOCAL_PATH).to(device)

## model inference
                          res = models[device](**inputs, output_hidden_states=True)
                           输入 inputs 就是中文 text
                           输出 res：  len(res) = 3,  res[0].shape = [1, 6, 2048] res[1].shape=[1, 2048]
                           len(res["hidden_states"]) = 25
                                   res["hidden_states"][0].shape = [1, 6, 2048]

Bert-vits2 中文特化底模
融合了BERT的预训练能力与VITS2的微调技术，旨在实现高质量的个性化语音合成。该模型能够处理多种自然语言处理任务，如文本转语音(TTS)，并支持不同语言的语音合成，特别是中文和日语。通过结合Transformer架构，Bert-VITS2能够生成高度自然、具有个性特色的语音。
由于文本到语音的任务特点，解决方案也可以是复杂的。之前的工作通过将从输入文本生成波形的过程分为两个级联阶段来解决这些问题。一种流行的方法涉及从第一阶段的输入文本中生成中间语音表示，如梅尔语谱图或语言特征，然后以第二阶段的这些中间表示为条件生成原始波形。两级系统具有简化每个模型和便于训练的优点；然而，它们也有以下限制。1)错误从第一阶段传播到第二阶段。2)它不是利用模型内部学习到的表示，而是通过人类定义的特征(如梅尔语谱图或语言特征)进行中介。3)生成中间特征所需的计算量。最近，为了解决这些限制，直接从输入文本中生成波形的单阶段模型已被积极研究。单阶段模型不仅优于两阶段管道系统，而且显示了生成与人类几乎不可区分的高质量语音的能力。
VITS模型（Variational Inference Text-to-Speech）：Variational Inference with adversarial learning for end-to-end Text-to-Speech，它是一种结合变分推理（variational inference）、标准化流（normalizing flows）和对抗训练的高性能语音合成模型。VITS通过隐变量而非频谱串联起语音合成的声学模型和声码器，并在隐变量上执行随机建模以及随机时长预测器，以此提高合成语音的多样性。VITS模型在训练时会生成梅尔频谱以指导模型训练，但在推理时不需要生成梅尔频谱，而是使用线性谱作为输入。此外，VITS模型采用了基于标准化的流模型（normalizing flows）的变分推理（variational inference）策略和对抗学习策略来提升生成模型的表现力。
- https://readmedium.com/vits-text-to-speech-synthesis-935fdd778d82
- https://arxiv.org/abs/2106.06103
- https://blog.csdn.net/qq_39247879/article/details/132168384
- model load and inference
```
## model load
Webui.py
     net_g = get_net_g(model_path=config.webui_config.model, version=version, device=device, hps=hps)
                         _ = utils.load_checkpoint(model_path, net_g, None, skip_optimizer=True)
              models.py    class SynthesizerTrn(nn.Module):
 
## model inference                       
     infer.py     net_g.infer()  
              model.py    infer(phonemes, tone, language, bert, emo, ...)
```

4. Pipeline

graph TB
    Input_m(Input emotion)--> emo[get_clap_text_feature -> emotional/clap-htsat-fused]
    Input_t(Input Text)--> gettext_clean[get_text -> clean_text-> g2pW]
    Input_t(Input Text)--> gettext_bert[get_text -> get_bert -> Erlangshen-MegatronBert]
    emo -- emotion embedding --> bert[Bert-vits2 中文特化底模]
    gettext_clean-- phonemes, tone, word2ph --> bert[Bert-vits2 中文特化底模]
    gettext_bert -- bert --> bert[Bert-vits2 中文特化底模]
    bert --> audio[audio]

     infer.py --> infer()
          get_clap_text_feature(emotion) -> ./emotional/clap-htsat-fused/
          get_text()
                 clean_text() --> g2pW.onnx 
                 get_bert() -->  ./bert/Erlangshen-MegatronBert-1.3B-Chinese/
          models.py --> SynthesizerTrn() Bert-vits2 中文特化底模-> infer(phonemes, tone, language, bert, emo, ...)
                 self.enc_p() -->TextEncoder()    encoder 是把 phonemes + tone + emo + bert + ... 全部加起来作为输入
                 ......

phonemes 是音节，音素，比如 “你好夏天”, get_text() 得到的音节是 phone = [0, 62, 40, 37, 16, 98, 43, 78, 44, 0], 每两个元素表示一个音节，四个字用中间 8 个元素表示
tone 是音调，一声，二声，三声，四声，比如 “你好夏天”, get_text() 得到的 tone = [0, 2, 2, 3, 3, 4, 4, 1, 1, 0]
word2ph 从code 里看 word2ph.append(len(phone)) ，比如 “你好夏天”, get_text() 得到 word2ph = [1, 2, 2, 2, 2, 1] , 表示每个 phoneme 用几个元素表示
VITS 模型的输入即 TextEncoder 的输入是把 phonemes + tone + emo + bert + ... 全部加起来，参考 code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to run Bert-VITS2 on Intel CPU, iGPU, dGPU #396

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

How to run Bert-VITS2 on Intel CPU, iGPU, dGPU #396

Uh oh!

Uh oh!

feng-intel Aug 28, 2024

How to run Bert-VITS2 on Intel CPU, iGPU, dGPU

1. Steps

2. Performance improvement

3. Models

4. Pipeline

Replies: 0 comments

feng-intel
Aug 28, 2024