This is the official repository for Reveal the Unknown: Out-of-Knowledge-Base Mention Discovery with Entity Linking, accepted for CIKM 2023.
The study adapts BERT-based Entity Linking (BLINK) to identify mentions that do not have corresponding KB entities by matching them to a special NIL entity, with NIL entity representation and classification, and synonym enhancement.
The study also applies KB Pruning and Versioning strategies to automatically construct out-of-KB datasets from common in-KB Entity Linking datasets. Please see the model training and data construction scripts below.
Note: we noticed some Dependabot alerts from GitHub related to the previous versions of libraries (Transformers, PyTorch, NLTK, and Flair, as in requirements.txt), but we have limited bandwidth to resolve them for this research-based project. Please be aware of this when you are using the project.
See step_all_BLINK.sh for running BLINK models with Threshold-based and NIL-rep-based methods.
See step_all_BLINKout.sh for running BLINKout models and the dynamic feature baseline.
See step_all_BM25+cross-enc.sh for all BM25+BERT models.
For all scripts above:
- setting
dataset(andmm_onto_ver_model_markfor MedMentions) - setting
bi_enc_bertmodelandcross_enc_bertmodel(and changefurther_model_markaccordingly) - setting
train_bi(except BM25),rep_ents,train_cross,inferencetotrueto perform each step. - setting
use_best_top_kastrueif using tuned top-k, otherwise using default
For step_all_BLINK.sh, further
- setting
use_NIL_thresholdtotruewhen using the Threshold-based approach (and the correspondingth2as threshold value for each dataset) - setting
use_NIL_rankingtotruewhen using the NIL-rep-based approach (and setting NIL representation binary parameters ofuse_NIL_tag,use_NIL_desc, anduse_NIL_desc_tag)
For step_all_BLINKout.sh, further
- setting NIL representation binary parameters of
use_NIL_tag,use_NIL_desc, anduse_NIL_desc_tag. - setting
dynamic_emb_extra_ft_baselinetotrueand select the corresponding line (around 273-274) to use either the NIL regulariser (gu2021) or the dynamic feature baseline (full-features-NIL-infer), also setting the value oflambda_NIL.
For step_all_BM25+cross-enc.sh
- requiring the tokenizer of the saved biencoder model, so run
step_all_BLINK.shwith the same biencoder model first before running this script.
Link to out-of-KB mention discovery datasets: https://zenodo.org/record/8228371.
We acknowledge the sources below for data construction:
-
ShARe/CLEF 2013 dataset is from https://physionet.org/content/shareclefehealth2013/1.0/
-
MedMention dataset is from https://github.com/chanzuckerberg/MedMentions
-
UMLS (versions 2012AB, 2014AB, 2017AA) is from https://www.nlm.nih.gov/research/umls/index.html
-
SNOMED CT (corresponding versions) is from https://www.nlm.nih.gov/healthit/snomedct/index.html
-
NILK dataset is from https://zenodo.org/record/6607514
-
WikiData 2017 dump is from https://archive.org/download/enwiki-20170220/enwiki-20170220-pages-articles.xml.bz2
See files under the preprocessing folder, where running scripts to create the datasets are in run_preprocess_ents_and_data.sh.
The repository is based on BLINK under the MIT license. Also, we acknowledge the data sources above.