Skip to content

Support replacing homophonic phrases #2153

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 27, 2025

Conversation

csukuangfj
Copy link
Collaborator

@csukuangfj csukuangfj commented Apr 27, 2025

Usage

  1. Generate a replace.fst. You can find an example at
    https://colab.research.google.com/drive/1jEaS3s8FbRJIcVQJv2EQx19EM_mnuARi?usp=sharing

  2. Use it with a Chinese ASR model. You can use any ASR model from sherpa-onnx as long as it outputs Chinese.

  3. Example with sense voice

curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17.tar.bz2
tar xvf sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17.tar.bz2
rm sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17.tar.bz2


curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/hr-files/dict.tar.bz2
tar xf dict.tar.bz2
rm dict.tar.bz2

curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/hr-files/replace.fst
curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/hr-files/test-hr.wav
curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/hr-files/lexicon.txt

./build/bin/sherpa-onnx-offline \
  --tokens=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/tokens.txt \
  --sense-voice-model=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/model.int8.onnx \
  --num-threads=2 \
  --sense-voice-use-itn=1 \
  --debug=0 \
  --hr-lexicon=./lexicon.txt \
  --hr-dict-dir=./dict \
  --hr-rule-fsts=./replace.fst \
  ./test-hr.wav

The output is given below:

./test-hr.wav
{"lang": "<|zh|>", "emotion": "<|NEUTRAL|>", "event": "<|Speech|>", "text": "现代测试名字识别,丹尼尔·波维林美丽、峤峤、球球、豆豆、橙橙、果果苗苗。", "timestamps": [0.66, 0.90, 1.20, 1.38, 1.80, 1.92, 2.16, 2.28, 2.58, 3.48, 3.66, 3.84, 4.08, 4.26, 5.58, 5.94, 6.12, 6.42, 7.44, 7.74, 8.10, 9.00, 9.36, 9.72, 10.68, 11.04, 11.40, 12.42, 12.66, 13.02, 13.86, 14.10, 15.30, 15.60, 16.74], "tokens":["现", "代", "测", "试", "名", "字", "识", "别", ",", "丹", "尼", "尔", "波", "为", "林", "美", "丽", "、", "乔", "乔", "、", "球", "球", "、", "豆", "豆", "、", "晨", "晨", "、", "果", "果", "苗", "苗", "。"], "words": []}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.931 s
Real time factor (RTF): 0.931 / 16.861 = 0.055

If we don't use this PR, the following command

./build/bin/sherpa-onnx-offline \
  --tokens=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/tokens.txt \
  --sense-voice-model=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/model.int8.onnx \
  --num-threads=2 \
  --sense-voice-use-itn=1 \
  --debug=0 \
  ./test-hr.wav

has the following output

./test-hr.wav
{"lang": "<|zh|>", "emotion": "<|NEUTRAL|>", "event": "<|Speech|>", "text": "现代测试名字识别,丹尼尔波为林美丽、乔乔、球球、豆豆、晨晨、果果苗苗。", "timestamps": [0.66, 0.90, 1.20, 1.38, 1.80, 1.92, 2.16, 2.28, 2.58, 3.48, 3.66, 3.84, 4.08, 4.26, 5.58, 5.94, 6.12, 6.42, 7.44, 7.74, 8.10, 9.00, 9.36, 9.72, 10.68, 11.04, 11.40, 12.42, 12.66, 13.02, 13.86, 14.10, 15.30, 15.60, 16.74], "tokens":["现", "代", "测", "试", "名", "字", "识", "别", ",", "丹", "尼", "尔", "波", "为", "林", "美", "丽", "、", "乔", "乔", "、", "球", "球", "、", "豆", "豆", "、", "晨", "晨", "、", "果", "果", "苗", "苗", "。"], "words": []}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.874 s
Real time factor (RTF): 0.874 / 16.861 = 0.052

Compare the results below:

  • With this PR: 现代测试名字识别,丹尼尔·波维林美丽、峤峤、球球、豆豆、橙橙、果果苗苗。
  • Without this PR: 现代测试名字识别,丹尼尔波为林美丽、乔乔、球球、豆豆、晨晨、果果苗苗。

If you don't have access to the colab notebook, here is the code for generating replace.fst:

import pynini
from pynini.lib import utf8, byte
from pynini import cdrewrite

sigma = utf8.VALID_UTF8_CHAR.star

rule1 = pynini.cross("dan1ni2er3bo1wei2", "丹尼尔·波维")
rule10 = pynini.cross("dan1ni2er3bo1wei4", "丹尼尔·波维")
rule2 = pynini.cross('dou4dou4', '豆豆')
rule3 = pynini.cross('cheng2cheng2', '橙橙')
rule30 = pynini.cross('chen2chen2', '橙橙')
rule4 = pynini.cross('qiao2qiao2', '峤峤')
rule5 = pynini.cross('qiu2qiu2', '球球')
rule6 = pynini.cross('lin2mei3li4', '林美丽')
rule7 = pynini.cross('guo3guo3', '果果')
rule8 = pynini.cross('miao2miao2', '苗苗')


rule = (rule1 | rule10 | rule2 | rule3 | rule30 | rule4 | rule5 | rule6 | rule7 | rule8).optimize()
rule = cdrewrite(rule, "", "", sigma)

rule.write('replace.fst')

Note that you need to use

pip install --only-binary :all: pynini

to install pynini

@csukuangfj csukuangfj merged commit f64c583 into k2-fsa:master Apr 27, 2025
144 of 213 checks passed
@csukuangfj csukuangfj deleted the replace-words branch April 27, 2025 07:31
@csukuangfj csukuangfj changed the title Support replacing homonphonic phrases Support replacing homophonic phrases May 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant