- Overview
- Model Key Features
- Dataset and Model available on Hugging Face
- Model Test Example Results
- Evaluation Results
- The Fine-Tune Process is Belows
- Running the Model via Gradio and FastAPI Interfaces
- Setup for Faster Whisper with Timestamps using
pr0mila-gh0sh/MediBeng-Whisper-Tiny
- Limitations
- Known Issue in Current Release
- Ethical Considerations
- Blog Post
- License
- Citation for Research Use
For many genAI solutions in the clinical domain, doctor-patient transcription is a key task. It becomes especially difficult in clinical settings when the language is code-switched, meaning multiple languages are mixed during conversation. This is common in multilingual environments, particularly in healthcare.
To address this challenge, we fine-tuned the Whisper Tiny model to translate code-switched Bengali and English speech into one language, making it easier for tasks like analysis, record-keeping, or integrating with other AI models. After fine-tuning, we obtained the MediBeng Whisper Tiny model.
The dataset, MediBeng, has been created using synthetic data generation to simulate code-switched conversations, allowing for more robust training of the model. This dataset was used in the fine-tuning process.
This solution is designed to transcribe and translate code-switched Bengali-English conversations into English in clinical settings, helping practitioners process the information more effectively and use it for patient records or decision-making. We achieved great results after fine-tuning the model.
- Base model: openai/whisper-tiny
- Fine-tuned for: Translation task (code-mixed Bengali-English → English)
- Domain: Clinical/Medical
- Language support: Code-mixed Bengali-English (input), English (output)
- Fine-tuning approach: A very simple and easy-to-use fine-tuning script has been created to simplify the fine-tuning process for the translation task, ensuring efficient model adaptation.
- Open-source: Both the model and the dataset (MediBeng) are open-source, and the entire process, including fine-tuning, is available for public use and contribution.
📂 Dataset: Check out the MediBeng (20% subset) dataset used to fine-tune this model! This dataset includes synthetic code-switched clinical conversations in Bengali and English. It is designed to help train models for tasks like speech recognition (ASR), text-to-speech (TTS), and machine translation, focusing on bilingual code-switching in healthcare settings.
🔗 Full Dataset Link: MediBeng Dataset
🔧 Dataset Parquet File Creation: Here's how I loaded the dataset to Hugging Face!
🔗 Repo Link for Parquet-to-HuggingFace Process: Parquet-to-HuggingFace Process
You can access the fine-tuned model on Hugging Face using the link below:
🔗 Model Link: MediBeng-Whisper-Tiny
Below are some results showing the actual translations compared to Medibeng Whisper Tiny model translations.
Audio Name | Code-Switched Bengali-English Clinical Actual Conversations | Actual Translation | Whisper Tiny Translation | Medibeng Whisper-Tiny Translation 🚀 |
---|---|---|---|---|
Female-Bengali-English-2045 | আপনার শারীরিক পরীক্ষা করতে হবে, and we will schedule that shortly.। | You need a physical check-up, and we will schedule that shortly. | You can cut the hair and we will schedule that shortly. | You need a physical check-up, and we will schedule that shortly. |
Female-Bengali-English-2065 | আপনার রক্তচাপ খুব বেশি, but আমরা monitor করতে থাকব। | Your blood pressure is very high, but we will keep monitoring. | You can also find out about the national rock club in the city of Baitamra. | Your blood pressure is very high, but we will keep monitoring. |
Female-Bengali-English-2072 | আপনার শরীরের ব্যথা অনেক, please let me know if it’s severe. | You have a lot of body pain, please let me know if it’s severe. | Please let me know if it's a way out. | You have a lot of body pain, please let me know if it’s severe. |
Male-Bengali-English-1959 | আপনার শরীরের তাপমাত্রা 103°F, which indicates a fever. | Your body temperature is 103°F, which indicates a fever. | You should read it, the mantra actually, the Indigree Fahrenheit which indicates a fever. | Your body temperature is 103°F, which indicates a fever. |
Male-Bengali-English-2372 | আপনার হাতের আঙুলে কিছু সমস্যা হয়েছে, Let me take a closer look. | You have some issues with your fingers, Let me take a closer look. | You were the one who was the one who was the famous man. Let me take a closer look. | You have some issues with your fingers, Let me take a closer look. |
Male-Bengali-English-2338 | আপনি কি নিয়মিত ব্যায়াম করেন? It’s essential for overall health. | Do you exercise regularly? It’s essential for overall health. | You need a new me to BAM KORIN, it's essential for overall health. | Do you exercise regularly? It’s essential for overall health. |
The audio files used for these examples are stored in the tests/data
directory in the repository. For example:
tests/data/Female-Bengali-English-2045.wav
The model's performance improved as the training progressed, showing consistent reduction in training loss and Word Error Rate (WER) on the evaluation set.
Epoch | Training Loss | Training Grad Norm | Learning Rate | Eval Loss | Eval WER |
---|---|---|---|---|---|
0.03 | 2.6213 | 61.56 | 4.80E-06 | - | - |
0.07 | 1.609 | 44.09 | 9.80E-06 | 1.13 | 107.72 |
0.1 | 0.7685 | 52.27 | 9.47E-06 | - | - |
0.13 | 0.4145 | 32.27 | 8.91E-06 | 0.37 | 47.53 |
0.16 | 0.3177 | 17.98 | 8.36E-06 | - | - |
0.2 | 0.222 | 7.7 | 7.80E-06 | 0.1 | 45.19 |
0.23 | 0.0915 | 1.62 | 7.24E-06 | - | - |
0.26 | 0.081 | 0.4 | 6.69E-06 | 0.04 | 38.35 |
0.33 | 0.0246 | 1.01 | 5.58E-06 | - | - |
0.36 | 0.0212 | 2.2 | 5.02E-06 | 0.01 | 41.88 |
0.42 | 0.0052 | 0.13 | 3.91E-06 | - | - |
0.46 | 0.0023 | 0.45 | 3.36E-06 | 0.01 | 34.07 |
0.52 | 0.0013 | 0.05 | 1.69E-06 | - | - |
0.55 | 0.0032 | 0.11 | 1.13E-06 | 0.01 | 29.52 |
0.62 | 0.001 | 0.09 | 5.78E-07 | - | - |
0.65 | 0.0012 | 0.08 | 2.22E-08 | 0 | 30.49 |
- Training Loss: The training loss decreases consistently, indicating the model is learning well.
- Eval Loss: The evaluation loss decreases significantly, showing that the model is generalizing well to unseen data.
- Eval WER: The Word Error Rate (WER) decreases over the epochs, indicating the model is getting better at transcribing code-switched Bengali-English speech.
Open a terminal on your machine and use the following command to clone the repository:
git clone https://github.com/pr0mila/MediBeng-Whisper-Tiny.git
cd MediBeng-Whisper-Tiny
To set up the environment, you can use one of the following commands:
conda create --name med-whisper-tiny python=3.9
conda activate med-whisper-tiny
Run the following command to install the packages listed in the requirements.txt file:
pip install -r requirements.txt
Or, install the packages
pip install torch transformers datasets librosa evaluate soundfile tensorboard jiwer accelerate
pip install fastapi uvicorn transformers librosa pydantic
pip install gradio==3.0.22
The configuration parameters for the model, dataset, and repository are defined in the config/config.py
file. For translated transcription, make sure to update the LANGUAGE
and TASK
variables as follows:
MODEL_NAME = "openai/whisper-tiny"
LANGUAGE = "English"
TASK = "translate"
The dataset is loaded and stored in the data
folder, which is created by running the data processing code in the data_loader.py
file. For training and testing, 20% of the data from the dataset is used for both training and testing. This configuration is defined and controlled in the data_loader.py
file.
A simple script has been created to easily run the fine-tuning for the translation task. To start training the model, run the following command:
python main.py
After setting up your Hugging Face token, follow these steps to upload your model to Hugging Face:
- Set your Hugging Face token as an environment variable:
Replace "your_hugging_face_token" with the actual token value.
export HF_TOKEN="your_hugging_face_token"
- Set the
OUTPUT_DIR
andREPO_ID
in yourconfig/config.py
file:
OUTPUT_DIR
: Directory where your model is saved.REPO_ID
: Your Hugging Face repository ID (the name of the repository where the model will be uploaded).
Example:
OUTPUT_DIR = "path/to/your/model"
REPO_ID = "your_huggingface_repo_id"
- Run the following command to upload your model to Hugging Face:
python upload_model/upload_model.py
The model testing script is available in the tests
directory.
This repository provides two ways to interact with the fine-tuned MediBeng Whisper Tiny model:
- Gradio Interface: A user-friendly web interface to upload audio files and get transcriptions.
- FastAPI API: A programmatic way to interact with the model through a RESTful API endpoint.
The Gradio Interface allows you to quickly test the model using a web-based user interface.
-
Navigate to the
app
directory where the Gradio interface is located. -
Run the Gradio interface:
python app/gradio_interface.py
-
Once the script runs, Gradio will provide a local link. Access the Gradio interface in your web browser by going to: http://127.0.0.1:7860
If you prefer to interact with the model programmatically, you can use the FastAPI API to send audio files and get transcriptions via a RESTful API.
-
Run the FastAPI server using the following command:
uvicorn app.api:app --reload
-
Once the server is running, to use the
/transcribe
endpoint, you can access the API documentation at: http://localhost:8000
The Faster Whisper API optimizes speech-to-text models for faster, more efficient transcription with timestamps. To use Faster Whisper with the pr0mila-gh0sh/MediBeng-Whisper-Tiny
model, follow these steps:
For more detailed steps on using Faster Whisper, refer to the official Faster Whisper GitHub repository.
-
Install dependencies:
pip install transformers[torch]>=4.23 pip install ctranslate2
-
Convert the
MediBeng-Whisper-Tiny
model to CTranslate2:ct2-transformers-converter --model pr0mila-gh0sh/MediBeng-Whisper-Tiny --output_dir medi-beng-whisper-tiny-ct2 --copy_files tokenizer.json preprocessor_config.json --quantization float16
-
Use the converted model: The CTranslate2 model will be automatically downloaded when loading the
pr0mila-gh0sh/MediBeng-Whisper-Tiny
model.
- Accents: The model may struggle with very strong regional accents or non-native speakers of Bengali and English.
- Specialized Terms: The model may not perform well with highly specialized medical terms or out-of-domain speech.
- Multilingual Support: While the model is designed for Bengali and English, other languages are not supported.
- Real-Time Processing: The current solution interfaces (FastAPI and Gradio) are implemented for batch processing. Real-time processing is not currently supported, and the system may not provide immediate transcriptions or translations during live interactions.
- Evaluation currently uses Word Error Rate (WER) during training.
- WER is not ideal for translation tasks.
- Future updates will include BLEU, METEOR, or chrF++ metrics for more accurate evaluation.
- Biases: The training data may contain biases based on the demographics of the speakers, such as gender, age, and accent.
- Misuse: Like any ASR system, this model could be misused to create fake transcripts of audio recordings, potentially leading to privacy and security concerns.
- Fairness: Ensure the model is used in contexts where fairness and ethical considerations are taken into account, particularly in clinical environments.
I’ve written a detailed blog post on Medium about MediBeng Whisper-Tiny and how it translates code-switched Bengali-English speech in healthcare. In this post, I discuss the dataset creation, model fine-tuning, and how this can improve healthcare transcription.
Read the full article here: MediBeng Whisper-Tiny: Translating Code-Switched Bengali-English Speech for Healthcare
This model is based on the Whisper-Tiny model by OpenAI available on Hugging Face. The original model is licensed under the Apache-2.0 license.
This fine-tuned version, Medibeng Whisper-Tiny, was trained on a code-switched Bengali-English dataset for use in clinical settings and is also shared under the Apache-2.0 license. See the LICENSE file for more details.
- You are free to use, modify, and distribute the model, as long as you comply with the conditions of the Apache License 2.0.
- You must provide attribution, including a reference to this model card and the repository when using or distributing the model.
- You cannot use the model for unlawful purposes or in any manner that infringes on the rights of others.
For more details, please review the full Apache License 2.0.
If you use Medibeng Whisper-Tiny or the MediBeng dataset for your research or project, please cite the following:
The preprint is available on medRxiv.
@article{ghosh2025medibeng,
title={MediBeng Whisper Tiny: A fine-tuned code-switched Bengali-English translator for clinical applications},
author={Ghosh, Promila and Talukder, Sunipun},
journal={medRxiv},
year={2025},
doi={https://doi.org/10.1101/2025.04.25.25326406},
url={https://www.medrxiv.org/content/10.1101/2025.04.25.25326406v1}
}
@misc{promila_ghosh_2025,
author = { Promila Ghosh },
title = { MediBeng (Revision b05b594) },
year = 2025,
url = { https://huggingface.co/datasets/pr0mila-gh0sh/MediBeng },
doi = { 10.57967/hf/5187 },
publisher = { Hugging Face }
}