|
1 |
| -# Vietnamese-ASR---Released-Model |
2 |
| -Vietnamese Automatic Speech Recognition using Wav2vec 2.0 |
| 1 | +--- |
| 2 | +language: vi |
| 3 | +datasets: |
| 4 | +- vivos |
| 5 | +- common_voice |
| 6 | +- FOSD |
| 7 | +- VLSP |
| 8 | +metrics: |
| 9 | +- wer |
| 10 | +pipeline_tag: automatic-speech-recognition |
| 11 | +tags: |
| 12 | +- audio |
| 13 | +- speech |
| 14 | +- Transformer |
| 15 | +- wav2vec2 |
| 16 | +- automatic-speech-recognition |
| 17 | +- vietnamese |
| 18 | +license: cc-by-nc-4.0 |
| 19 | +widget: |
| 20 | +- example_title: common_voice_vi_30519758.mp3 |
| 21 | + src: https://huggingface.co/khanhld/wav2vec2-base-vietnamese-160h/raw/main/examples/common_voice_vi_30519758.mp3 |
| 22 | +- example_title: VIVOSDEV15_020.wav |
| 23 | + src: https://huggingface.co/khanhld/wav2vec2-base-vietnamese-160h/raw/main/examples/VIVOSDEV15_020.wav |
| 24 | +model-index: |
| 25 | +- name: Wav2vec2 Base Vietnamese 160h |
| 26 | + results: |
| 27 | + - task: |
| 28 | + name: Speech Recognition |
| 29 | + type: automatic-speech-recognition |
| 30 | + dataset: |
| 31 | + name: common-voice-vietnamese |
| 32 | + type: common_voice |
| 33 | + args: vi |
| 34 | + metrics: |
| 35 | + - name: Test WER |
| 36 | + type: wer |
| 37 | + value: 10.78 |
| 38 | + - task: |
| 39 | + name: Speech Recognition |
| 40 | + type: automatic-speech-recognition |
| 41 | + dataset: |
| 42 | + name: VIVOS |
| 43 | + type: vivos |
| 44 | + args: vi |
| 45 | + metrics: |
| 46 | + - name: Test WER |
| 47 | + type: wer |
| 48 | + value: 15.05 |
| 49 | +--- |
| 50 | +[](https://paperswithcode.com/sota/speech-recognition-on-common-voice-vi?p=wav2vec2-base-vietnamese-160h) |
| 51 | +[](https://paperswithcode.com/sota/speech-recognition-on-vivos?p=wav2vec2-base-vietnamese-160h) |
| 52 | +# Vietnamese Speech Recognition using Wav2vec 2.0 |
| 53 | +### Table of contents |
| 54 | +1. [Model Description](#description) |
| 55 | +2. [Implementation](#implementation) |
| 56 | +3. [Benchmark Result](#benchmark) |
| 57 | +4. [Example Usage](#example) |
| 58 | +5. [Evaluation](#evaluation) |
| 59 | +6. [Citation](#citation) |
| 60 | +7. [Contact](#contact) |
| 61 | +<a name = "description" ></a> |
| 62 | +### Model Description |
| 63 | +Fine-tuned the Wav2vec2-based model on about 160 hours of Vietnamese speech dataset from different resources, including [VIOS](https://huggingface.co/datasets/vivos), [COMMON VOICE](https://huggingface.co/datasets/mozilla-foundation/common_voice_8_0), [FOSD](https://data.mendeley.com/datasets/k9sxg2twv4/4) and [VLSP 100h](https://drive.google.com/file/d/1vUSxdORDxk-ePUt-bUVDahpoXiqKchMx/view). We have not yet incorporated the Language Model into our ASR system but still gained a promising result. |
| 64 | +<a name = "implementation" ></a> |
| 65 | +### Implementation |
| 66 | +We also provide code for Pre-training and Fine-tuning the Wav2vec2 model. If you wish to train on your dataset, check it out here: |
| 67 | +- [Pre-train code](https://github.com/khanld/ASR-Wav2vec-Pretrain) (not available for now but will release soon) |
| 68 | +- [Fine-tune code](https://github.com/khanld/ASR-Wa2vec-Finetune) |
| 69 | + |
| 70 | +<a name = "benchmark" ></a> |
| 71 | +### Benchmark WER Result |
| 72 | +| | [VIVOS](https://huggingface.co/datasets/vivos) | [COMMON VOICE 8.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_8_0) | |
| 73 | +|---|---|---| |
| 74 | +|without LM| 15.05 | 10.78 | |
| 75 | +|with LM| in progress | in progress | |
| 76 | + |
| 77 | +<a name = "example" ></a> |
| 78 | +### Example Usage [](https://colab.research.google.com/drive/1blz1KclnIfbOp8o2fW3WJgObOQ9SMGBo?usp=sharing) |
| 79 | +```python |
| 80 | +from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC |
| 81 | +import librosa |
| 82 | +import torch |
| 83 | +device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| 84 | +processor = Wav2Vec2Processor.from_pretrained("khanhld/wav2vec2-base-vietnamese-160h") |
| 85 | +model = Wav2Vec2ForCTC.from_pretrained("khanhld/wav2vec2-base-vietnamese-160h") |
| 86 | +model.to(device) |
| 87 | +def transcribe(wav): |
| 88 | + input_values = processor(wav, sampling_rate=16000, return_tensors="pt").input_values |
| 89 | + logits = model(input_values.to(device)).logits |
| 90 | + pred_ids = torch.argmax(logits, dim=-1) |
| 91 | + pred_transcript = processor.batch_decode(pred_ids)[0] |
| 92 | + return pred_transcript |
| 93 | +wav, _ = librosa.load('path/to/your/audio/file', sr = 16000) |
| 94 | +print(f"transcript: {transcribe(wav)}") |
| 95 | +``` |
| 96 | + |
| 97 | +<a name = "evaluation"></a> |
| 98 | +### Evaluation [](https://colab.research.google.com/drive/1XQCq4YGLnl23tcKmYeSwaksro4IgC_Yi?usp=sharing) |
| 99 | + |
| 100 | +```python |
| 101 | +from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC |
| 102 | +from datasets import load_dataset |
| 103 | +import torch |
| 104 | +import re |
| 105 | +from datasets import load_dataset, load_metric, Audio |
| 106 | +wer = load_metric("wer") |
| 107 | +device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| 108 | +# load processor and model |
| 109 | +processor = Wav2Vec2Processor.from_pretrained("khanhld/wav2vec2-base-vietnamese-160h") |
| 110 | +model = Wav2Vec2ForCTC.from_pretrained("khanhld/wav2vec2-base-vietnamese-160h") |
| 111 | +model.to(device) |
| 112 | +model.eval() |
| 113 | +# Load dataset |
| 114 | +test_dataset = load_dataset("mozilla-foundation/common_voice_8_0", "vi", split="test", use_auth_token="your_huggingface_auth_token") |
| 115 | +test_dataset = test_dataset.cast_column("audio", Audio(sampling_rate=16000)) |
| 116 | +chars_to_ignore = r'[,?.!\-;:"“%\'�]' # ignore special characters |
| 117 | +# preprocess data |
| 118 | +def preprocess(batch): |
| 119 | + audio = batch["audio"] |
| 120 | + batch["input_values"] = audio["array"] |
| 121 | + batch["transcript"] = re.sub(chars_to_ignore, '', batch["sentence"]).lower() |
| 122 | + return batch |
| 123 | +# run inference |
| 124 | +def inference(batch): |
| 125 | + input_values = processor(batch["input_values"], |
| 126 | + sampling_rate=16000, |
| 127 | + return_tensors="pt").input_values |
| 128 | + logits = model(input_values.to(device)).logits |
| 129 | + pred_ids = torch.argmax(logits, dim=-1) |
| 130 | + batch["pred_transcript"] = processor.batch_decode(pred_ids) |
| 131 | + return batch |
| 132 | + |
| 133 | +test_dataset = test_dataset.map(preprocess) |
| 134 | +result = test_dataset.map(inference, batched=True, batch_size=1) |
| 135 | +print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_transcript"], references=result["transcript"]))) |
| 136 | +``` |
| 137 | +**Test Result**: 10.78% |
| 138 | + |
| 139 | +<a name = "citation" ></a> |
| 140 | +### Citation |
| 141 | +[](https://github.com/khanld/ASR-Wa2vec-Finetune) |
| 142 | +```text |
| 143 | +@misc{Khanhld_Vietnamese_Wav2vec_Asr_2022, |
| 144 | + author = {Duy Khanh Le}, |
| 145 | + doi = {10.5281/zenodo.6540979}, |
| 146 | + month = {May}, |
| 147 | + title = {Finetune Wav2vec 2.0 For Vietnamese Speech Recognition}, |
| 148 | + url = {https://github.com/khanld/ASR-Wa2vec-Finetune}, |
| 149 | + year = {2022} |
| 150 | +} |
| 151 | +``` |
| 152 | + |
| 153 | +<a name = "contact"></a> |
| 154 | +### Contact |
| 155 | +- khanhld218@uef.edu.vn |
| 156 | +- [](https://github.com/) |
| 157 | +- [](https://www.linkedin.com/in/khanhld257/) |
| 158 | + |
0 commit comments