-
Notifications
You must be signed in to change notification settings - Fork 101
Home
Unfortunately, we cannot provide the corpora due to the copyrights. The PubMed abstracts can be downloaded from https://www.ncbi.nlm.nih.gov/pubmed. The MIMIC-III Clinical Database can be downloaded from https://physionet.org/works/MIMICIIIClinicalDatabase/access.shtml.
The BioWordVec is in the binary word2vec C format. One way to read the model is using gensim
. The following example is copied from their website,
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format(filename, binary=True)
The BioWordVec is built upon sent2vec. To infer sentence embeddings, please see the Directly from python
section. The following example is copied from their website,
import sent2vec
model = sent2vec.Sent2vecModel()
model.load_model('model.bin')
emb = model.embed_sentence("once upon a time .")
embs = model.embed_sentences(["first sentence .", "another sentence"])
The preprocessing methods can be found in the src
folder. In general, the text was first tokenized using NLTK and then lowercased.
The bash scripts can be found in the src
folder.
@article{chen2018biosentvec,
title={BioSentVec: creating sentence embeddings for biomedical texts},
author={Chen, Qingyu and Peng, Yifan and Lu, Zhiyong},
journal={arXiv preprint arXiv:181302},
year={2018}
}