Taiyi (太一): A Bilingual (Chinese and English) Fine-Tuned Large Language Model for Diverse Biomedical Tasks

[ English | 中文 ]

Taiyi (太一): A Bilingual (Chinese and English) Fine-Tuned Large Language Model for Diverse Biomedical Tasks

Project Background

With the rapid development of deep learning technology, large language models (LLMs) like ChatGPT and DeepSeek have made significant progress in the field of natural language processing. In the biomedical domain, large language models can facilitate communication between doctors and patients, provide useful medical information, and hold great potential in areas such as clinical decision support, biomedical knowledge discovery, drug development, and personalized treatment planning. Therefore, this project focuses on developing a multilingual, multi-task large language model tailored for various biomedical scenarios, aiming to achieve high performance with low resource consumption. In October 2023, we released the initial version of a bilingual Chinese-English biomedical large language model—Taiyi. Research efforts have continued, and the development of Taiyi 2 has now been completed, with the model being open-sourced.

Major Updates in Taiyi 2

Compared to the Taiyi 1, Taiyi 2 introduces further research and improvements in areas such as the model backbone, data instructions, and task-specific instructions. The main updates are as follows:

Updated Backbone: Taiyi 2 replaces the original Qwen-7B backbone with GLM4-9B.
High-Quality Data Filtering: Based on dataset annotation guidelines, data quality has been further refined by removing low-quality samples. Additionally, the data distribution across different tasks has been rebalanced to address extreme imbalances.
Refined Task Instructions: Tasks are categorized by type, and experimental testing was conducted to evaluate various instruction construction methods. This led to the development of a refined, task-optimized instruction design strategy.

Performance of Taiyi 2

Taiyi 2 was evaluated on 13 biomedical task benchmark datasets, with results shown in the figure below.

On these biomedical datasets, the experimental results show that:

Taiyi 2 achieves an average performance improvement of approximately 9% over Taiyi 1.
Compared to general-domain models such as GPT-3.5 and the distilled version of DeepSeek-14B, Taiyi 2 shows an average improvement of around 25%.
Taiyi 2 achieves competitive results comparable to the current state-of-the-art domain-specific models.

Detailed metrics are presented in the table below:

Task Type	Dataset	Taiyi1	Taiyi2	GPT3.5	DeepSeek-14B	SOTA
NER (Micro-F1)	BC5CDR-Chem	80.2	90.2	60.3	42.3	93.3(PubMedBERT)
	BC5CDR-Dise	69.1	78.3	51.8	41.1	85.6(PubMedBERT)
	CHEMDNER	79.9	90.5	36.5	43.3	92.4(BioBERT)
	NCBIdisease	73.1	82.6	50.5	32.8	87.8(PubMedBERT)
	CMeEE-dev	65.7	74.1	47.0	42.4	74.0(CBLUE)
RE (Micro-F1)	BC5CDR	37.5	42.4	14.2	28.6	45.0(BioGPT)
RE (Micro-F1)	CMeIE-dev	43.2	50.3	30.6	4.5	54.9(CBLUE)
TC (Micro-F1)	BC7LitCovid	84.0	90.2	63.9	32.9	91.8(Bioformer)
	HOC	80.0	84.6	51.2	41.9	82.3(PubMedBERT)
	KUAKE_QIC-dev	77.4	80.4	48.5	47.5	85.9(CBLUE)
QA (Accuracy)	PubMedQA	54.4	58.8	76.5	46.4	73.4
	MedQA-USMLE	37.1	58.4	51.3	66.9	42.0
	MedQA-MCMLE	64.8	88.1	58.2	53.2	70.1(RoBERTA-large)
All	AVE	65.1	74.5	49.3	40.3	75.3

Model Usage

Environment Setup

The environment configuration we used for training and testing is as follows:

torch==2.4.0
ms_swift==2.6.1
transformers==4.44.0
transformers-stream-generator==0.0.5
vllm==0.6.0
vllm-flash-attn==2.6.1

To install all dependencies automatically using the command:

$ pip install -r requirements.txt

Model Inference

Referring to the taiyi2_chat.py file, it is recommended to use a GPU to ensure faster inference speed.

Development Team

Taiyi 2 was developed by the Dalian University of Technology Information Retrieval Research Laboratory（DUTIR）

Supervisors: Ling Luo, Jian Wang, Yuanyuan Sun, Hongfei Lin

Student Members: Zhijun Wang, Jiewei Qi, Juntao Li, Tengxiao Lv, Chao Liu, Haobin Yuan

Acknowledgements

The work of this project has been inspired and assisted by the following open-source projects and technologies. We would like to express our gratitude to the developers and contributors of these projects, including but not limited to:

GLM: https://github.com/THUDM/GLM-4
SWIFT: https://github.com/modelscope/ms-swift
BigBIO: https://github.com/bigscience-workshop/biomedical
PromptCBLUE: https://github.com/michael-wzhu/PromptCBLUE
The Taiyi logo was synthesized by ERNIE Bot

Disclaimer

The resources related to this project are for academic research purposes only and are strictly prohibited from commercial use. The use of the source code of this warehouse follows the open source license agreement Apache 2.0. During use, users are required to carefully read and comply with the following statements:

Please ensure that the content you input does not infringe on the rights and interests of others, does not involve harmful information, and does not contain any content related to politics, violence, or pornography, and all input content is legal and compliant.
Please confirm and be aware that all content generated using the Taiyi model is generated by artificial intelligence models, and the generated content is not entirely rational. This project does not guarantee the accuracy, completeness, and functionality of the generated content, nor assumes any legal responsibility.
Any responses that violate laws, regulations, public order, or good customs in this model do not represent the attitude, viewpoint, or stance of this project. This project will continuously improve the model responses to make them more in line with social ethics and moral norms.
For any content output by the model, the user shall bear their own risks and responsibilities. This project does not assume any legal responsibility, nor shall they be liable for any losses that may arise from the use of relevant resources and output results.
The third-party links or libraries appearing in this project are for convenience only, and their content and viewpoints are not related to this project. Users need to distinguish themselves when using, and this project does not assume any joint liability;
If users discover any significant errors in the project, please provide feedback to us to help us fix them in a timely manner.

By using this project, you have carefully read, understood, and agreed to abide by the above disclaimer. This project reserves the right to modify this statement without prior notice to anyone.

Citation

If you use the repository of this project, please cite it.

@article{Taiyi,
  title="{Taiyi: A Bilingual Fine-Tuned Large Language Model for Diverse Biomedical Tasks}",
  author={Ling Luo, Jinzhong Ning, Yingwen Zhao, Zhijun Wang, Zeyuan Ding, Peng Chen, Weiru Fu, Qinyu Han, Guangtao Xu, Yunzhi Qiu, Dinghao Pan, Jiru Li, Hao Li, Wenduo Feng, Senbo Tu, Yuqi Liu, Zhihao Yang, Jian Wang, Yuanyuan Sun, Hongfei Lin},
  journal={Journal of the American Medical Informatics Association},
  year={2024},
  doi = {10.1093/jamia/ocae037},
  url = {https://doi.org/10.1093/jamia/ocae037},
}

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
data_file		data_file
fig		fig
LICENSE		LICENSE
README.md		README.md
README_ZH.md		README_ZH.md
Taiyi_Instruction_Data_Example.jsonl		Taiyi_Instruction_Data_Example.jsonl
requirements.txt		requirements.txt
taiyi2_chat.py		taiyi2_chat.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Taiyi (太一): A Bilingual (Chinese and English) Fine-Tuned Large Language Model for Diverse Biomedical Tasks

Major Updates in Taiyi 2

Performance of Taiyi 2

Model Usage

Environment Setup

Model Inference

Development Team

Acknowledgements

Disclaimer

Citation

Star History

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

DUTIR-BioNLP/Taiyi-LLM

Folders and files

Latest commit

History

Repository files navigation

Taiyi (太一): A Bilingual (Chinese and English) Fine-Tuned Large Language Model for Diverse Biomedical Tasks

Major Updates in Taiyi 2

Performance of Taiyi 2

Model Usage

Environment Setup

Model Inference

Development Team

Acknowledgements

Disclaimer

Citation

Star History

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages