[ English | 中文 ]
🤗 Hugging Face | 🤖 ModelScope | 🖥️ Demo | 🗂️Data | 📃Paper | WeChat (微信)
Taiyi (太一): A Bilingual (Chinese and English) Fine-Tuned Large Language Model for Diverse Biomedical Tasks
Project Background
With the rapid development of deep learning technology, large language models (LLMs) like ChatGPT and DeepSeek have made significant progress in the field of natural language processing. In the biomedical domain, large language models can facilitate communication between doctors and patients, provide useful medical information, and hold great potential in areas such as clinical decision support, biomedical knowledge discovery, drug development, and personalized treatment planning. Therefore, this project focuses on developing a multilingual, multi-task large language model tailored for various biomedical scenarios, aiming to achieve high performance with low resource consumption. In October 2023, we released the initial version of a bilingual Chinese-English biomedical large language model—Taiyi. Research efforts have continued, and the development of Taiyi 2 has now been completed, with the model being open-sourced.
Compared to the Taiyi 1, Taiyi 2 introduces further research and improvements in areas such as the model backbone, data instructions, and task-specific instructions. The main updates are as follows:
- Updated Backbone: Taiyi 2 replaces the original Qwen-7B backbone with GLM4-9B.
- High-Quality Data Filtering: Based on dataset annotation guidelines, data quality has been further refined by removing low-quality samples. Additionally, the data distribution across different tasks has been rebalanced to address extreme imbalances.
- Refined Task Instructions: Tasks are categorized by type, and experimental testing was conducted to evaluate various instruction construction methods. This led to the development of a refined, task-optimized instruction design strategy.
Taiyi 2 was evaluated on 13 biomedical task benchmark datasets, with results shown in the figure below.
On these biomedical datasets, the experimental results show that:
- Taiyi 2 achieves an average performance improvement of approximately 9% over Taiyi 1.
- Compared to general-domain models such as GPT-3.5 and the distilled version of DeepSeek-14B, Taiyi 2 shows an average improvement of around 25%.
- Taiyi 2 achieves competitive results comparable to the current state-of-the-art domain-specific models.
Detailed metrics are presented in the table below:
Task Type | Dataset | Taiyi1 | Taiyi2 | GPT3.5 | DeepSeek-14B | SOTA |
---|---|---|---|---|---|---|
NER (Micro-F1) | BC5CDR-Chem | 80.2 | 90.2 | 60.3 | 42.3 | 93.3(PubMedBERT) |
BC5CDR-Dise | 69.1 | 78.3 | 51.8 | 41.1 | 85.6(PubMedBERT) | |
CHEMDNER | 79.9 | 90.5 | 36.5 | 43.3 | 92.4(BioBERT) | |
NCBIdisease | 73.1 | 82.6 | 50.5 | 32.8 | 87.8(PubMedBERT) | |
CMeEE-dev | 65.7 | 74.1 | 47.0 | 42.4 | 74.0(CBLUE) | |
RE (Micro-F1) | BC5CDR | 37.5 | 42.4 | 14.2 | 28.6 | 45.0(BioGPT) |
CMeIE-dev | 43.2 | 50.3 | 30.6 | 4.5 | 54.9(CBLUE) | |
TC (Micro-F1) | BC7LitCovid | 84.0 | 90.2 | 63.9 | 32.9 | 91.8(Bioformer) |
HOC | 80.0 | 84.6 | 51.2 | 41.9 | 82.3(PubMedBERT) | |
KUAKE_QIC-dev | 77.4 | 80.4 | 48.5 | 47.5 | 85.9(CBLUE) | |
QA (Accuracy) | PubMedQA | 54.4 | 58.8 | 76.5 | 46.4 | 73.4 |
MedQA-USMLE | 37.1 | 58.4 | 51.3 | 66.9 | 42.0 | |
MedQA-MCMLE | 64.8 | 88.1 | 58.2 | 53.2 | 70.1(RoBERTA-large) | |
All | AVE | 65.1 | 74.5 | 49.3 | 40.3 | 75.3 |
The environment configuration we used for training and testing is as follows:
torch==2.4.0
ms_swift==2.6.1
transformers==4.44.0
transformers-stream-generator==0.0.5
vllm==0.6.0
vllm-flash-attn==2.6.1
To install all dependencies automatically using the command:
$ pip install -r requirements.txt
Referring to the taiyi2_chat.py
file, it is recommended to use a GPU to ensure faster inference speed.
Taiyi 2 was developed by the Dalian University of Technology Information Retrieval Research Laboratory(DUTIR)
Supervisors: Ling Luo, Jian Wang, Yuanyuan Sun, Hongfei Lin
Student Members: Zhijun Wang, Jiewei Qi, Juntao Li, Tengxiao Lv, Chao Liu, Haobin Yuan
The work of this project has been inspired and assisted by the following open-source projects and technologies. We would like to express our gratitude to the developers and contributors of these projects, including but not limited to:
- GLM: https://github.com/THUDM/GLM-4
- SWIFT: https://github.com/modelscope/ms-swift
- BigBIO: https://github.com/bigscience-workshop/biomedical
- PromptCBLUE: https://github.com/michael-wzhu/PromptCBLUE
- The Taiyi logo was synthesized by ERNIE Bot
The resources related to this project are for academic research purposes only and are strictly prohibited from commercial use. The use of the source code of this warehouse follows the open source license agreement Apache 2.0. During use, users are required to carefully read and comply with the following statements:
- Please ensure that the content you input does not infringe on the rights and interests of others, does not involve harmful information, and does not contain any content related to politics, violence, or pornography, and all input content is legal and compliant.
- Please confirm and be aware that all content generated using the Taiyi model is generated by artificial intelligence models, and the generated content is not entirely rational. This project does not guarantee the accuracy, completeness, and functionality of the generated content, nor assumes any legal responsibility.
- Any responses that violate laws, regulations, public order, or good customs in this model do not represent the attitude, viewpoint, or stance of this project. This project will continuously improve the model responses to make them more in line with social ethics and moral norms.
- For any content output by the model, the user shall bear their own risks and responsibilities. This project does not assume any legal responsibility, nor shall they be liable for any losses that may arise from the use of relevant resources and output results.
- The third-party links or libraries appearing in this project are for convenience only, and their content and viewpoints are not related to this project. Users need to distinguish themselves when using, and this project does not assume any joint liability;
- If users discover any significant errors in the project, please provide feedback to us to help us fix them in a timely manner.
By using this project, you have carefully read, understood, and agreed to abide by the above disclaimer. This project reserves the right to modify this statement without prior notice to anyone.
If you use the repository of this project, please cite it.
@article{Taiyi,
title="{Taiyi: A Bilingual Fine-Tuned Large Language Model for Diverse Biomedical Tasks}",
author={Ling Luo, Jinzhong Ning, Yingwen Zhao, Zhijun Wang, Zeyuan Ding, Peng Chen, Weiru Fu, Qinyu Han, Guangtao Xu, Yunzhi Qiu, Dinghao Pan, Jiru Li, Hao Li, Wenduo Feng, Senbo Tu, Yuqi Liu, Zhihao Yang, Jian Wang, Yuanyuan Sun, Hongfei Lin},
journal={Journal of the American Medical Informatics Association},
year={2024},
doi = {10.1093/jamia/ocae037},
url = {https://doi.org/10.1093/jamia/ocae037},
}