Biomedical Entity Linking for Dutch: Fine-tuning a Self-alignment BERT Model on an Automatically Generated Wikipedia Corpus
This repository contains the code for generating the training data and training and evaluating the sapBERT+fine-tuned Dutch biomedical entity linking model as presented in the paper.
- a RoBERTa-based basemodel that is trained from scratch on Dutch hospital notes (medRoBERTa.nl).
- that is 2nd-phase pretrained using self-alignment on a UMLS-derived Dutch biomedical ontology.
- and finally fine-tuned on automatically generated weakly labelled corpus from Wikipedia (WALVIS).
- evaluation results on Mantra GSC corpus can be found in the report.
The code for enhancing the UMLS and creating a biomedical ontology for biomedical entity linking (1_enhance_UMLS) is forked from the Dutch-medical-concepts repository from the UMCU. The code for self-alignment pretraining and fine-tuning is largely re-used from the code base of the original sapBERT paper.
For enhancing the UMLS a UMLS and SNOMED NL license should be requested.
The ONTOLOGY-browser is a minimal Flask-based browser tool for comparing UMLS entries.