Support replacing homophonic phrases #2153

csukuangfj · 2025-04-27T07:25:48Z

Usage

Generate a replace.fst. You can find an example at
https://colab.research.google.com/drive/1jEaS3s8FbRJIcVQJv2EQx19EM_mnuARi?usp=sharing
Use it with a Chinese ASR model. You can use any ASR model from sherpa-onnx as long as it outputs Chinese.
Example with sense voice

curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17.tar.bz2
tar xvf sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17.tar.bz2
rm sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17.tar.bz2


curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/hr-files/dict.tar.bz2
tar xf dict.tar.bz2
rm dict.tar.bz2

curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/hr-files/replace.fst
curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/hr-files/test-hr.wav
curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/hr-files/lexicon.txt

./build/bin/sherpa-onnx-offline \
  --tokens=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/tokens.txt \
  --sense-voice-model=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/model.int8.onnx \
  --num-threads=2 \
  --sense-voice-use-itn=1 \
  --debug=0 \
  --hr-lexicon=./lexicon.txt \
  --hr-dict-dir=./dict \
  --hr-rule-fsts=./replace.fst \
  ./test-hr.wav

The output is given below:

./test-hr.wav
{"lang": "<|zh|>", "emotion": "<|NEUTRAL|>", "event": "<|Speech|>", "text": "现代测试名字识别，丹尼尔·波维林美丽、峤峤、球球、豆豆、橙橙、果果苗苗。", "timestamps": [0.66, 0.90, 1.20, 1.38, 1.80, 1.92, 2.16, 2.28, 2.58, 3.48, 3.66, 3.84, 4.08, 4.26, 5.58, 5.94, 6.12, 6.42, 7.44, 7.74, 8.10, 9.00, 9.36, 9.72, 10.68, 11.04, 11.40, 12.42, 12.66, 13.02, 13.86, 14.10, 15.30, 15.60, 16.74], "tokens":["现", "代", "测", "试", "名", "字", "识", "别", "，", "丹", "尼", "尔", "波", "为", "林", "美", "丽", "、", "乔", "乔", "、", "球", "球", "、", "豆", "豆", "、", "晨", "晨", "、", "果", "果", "苗", "苗", "。"], "words": []}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.931 s
Real time factor (RTF): 0.931 / 16.861 = 0.055

If we don't use this PR, the following command

./build/bin/sherpa-onnx-offline \
  --tokens=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/tokens.txt \
  --sense-voice-model=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/model.int8.onnx \
  --num-threads=2 \
  --sense-voice-use-itn=1 \
  --debug=0 \
  ./test-hr.wav

has the following output

./test-hr.wav
{"lang": "<|zh|>", "emotion": "<|NEUTRAL|>", "event": "<|Speech|>", "text": "现代测试名字识别，丹尼尔波为林美丽、乔乔、球球、豆豆、晨晨、果果苗苗。", "timestamps": [0.66, 0.90, 1.20, 1.38, 1.80, 1.92, 2.16, 2.28, 2.58, 3.48, 3.66, 3.84, 4.08, 4.26, 5.58, 5.94, 6.12, 6.42, 7.44, 7.74, 8.10, 9.00, 9.36, 9.72, 10.68, 11.04, 11.40, 12.42, 12.66, 13.02, 13.86, 14.10, 15.30, 15.60, 16.74], "tokens":["现", "代", "测", "试", "名", "字", "识", "别", "，", "丹", "尼", "尔", "波", "为", "林", "美", "丽", "、", "乔", "乔", "、", "球", "球", "、", "豆", "豆", "、", "晨", "晨", "、", "果", "果", "苗", "苗", "。"], "words": []}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.874 s
Real time factor (RTF): 0.874 / 16.861 = 0.052

Compare the results below:

With this PR: 现代测试名字识别，丹尼尔·波维林美丽、峤峤、球球、豆豆、橙橙、果果苗苗。
Without this PR: 现代测试名字识别，丹尼尔波为林美丽、乔乔、球球、豆豆、晨晨、果果苗苗。

If you don't have access to the colab notebook, here is the code for generating replace.fst:

import pynini
from pynini.lib import utf8, byte
from pynini import cdrewrite

sigma = utf8.VALID_UTF8_CHAR.star

rule1 = pynini.cross("dan1ni2er3bo1wei2", "丹尼尔·波维")
rule10 = pynini.cross("dan1ni2er3bo1wei4", "丹尼尔·波维")
rule2 = pynini.cross('dou4dou4', '豆豆')
rule3 = pynini.cross('cheng2cheng2', '橙橙')
rule30 = pynini.cross('chen2chen2', '橙橙')
rule4 = pynini.cross('qiao2qiao2', '峤峤')
rule5 = pynini.cross('qiu2qiu2', '球球')
rule6 = pynini.cross('lin2mei3li4', '林美丽')
rule7 = pynini.cross('guo3guo3', '果果')
rule8 = pynini.cross('miao2miao2', '苗苗')


rule = (rule1 | rule10 | rule2 | rule3 | rule30 | rule4 | rule5 | rule6 | rule7 | rule8).optimize()
rule = cdrewrite(rule, "", "", sigma)

rule.write('replace.fst')

Note that you need to use

pip install --only-binary :all: pynini

to install pynini

csukuangfj added 2 commits April 25, 2025 16:46

Update kaldifst to v1.7.13

3c0f7b8

Support replacing homonphonic phrases

7291174

csukuangfj merged commit f64c583 into k2-fsa:master Apr 27, 2025
144 of 213 checks passed

csukuangfj deleted the replace-words branch April 27, 2025 07:31

csukuangfj mentioned this pull request May 7, 2025

哪一个ASR模型的准确率更优？/推荐使用哪个ASR模型 #1906

Closed

This was referenced May 14, 2025

sense_voice add hotwords thewh1teagle/sherpa-rs#99

Closed

Support replacing homonphonic phrases thewh1teagle/sherpa-rs#105

Open

csukuangfj changed the title ~~Support replacing homonphonic phrases~~ Support replacing homophonic phrases May 20, 2025

This was referenced May 21, 2025

扫盲关于在离线模型中使用热词 #2232

Closed

[bug]内存泄露，使用python最新版本转录1小时的会议录音 #2239

Closed

CTC解码hotwords方案 #1797

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support replacing homophonic phrases #2153

Support replacing homophonic phrases #2153

Uh oh!

csukuangfj commented Apr 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Support replacing homophonic phrases #2153

Support replacing homophonic phrases #2153

Uh oh!

Conversation

csukuangfj commented Apr 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Usage

Uh oh!

Uh oh!

Uh oh!

csukuangfj commented Apr 27, 2025 •

edited

Loading