BioELMo is a biomedical version of embeddings from language model (ELMo), pre-trained on PubMed abstracts. Pre-training uses 10M recent PubMed abstracts (2.46B tokens in total), and BioELMo achieves an averaged forward and backward perplexity of 31.37 on a held-out test set. BioELMo encodes biomedical entity-type and relational information pretty well, as shown in our paper.
You can use BioELMo as a fixed-feature extractor for downstream tasks using these weights:
- BioELMo weights
- options
- vocabulary file (1M most frequenty tokens from the pre-training corpus. For downstream tasks, you can use your own vocabulary.)
You can further fine-tune BioELMo on other corpora using the Tensorflow checkpoint. See this for details.
Please visit https://github.com/allenai/bilm-tf. Basically, you use BioELMo the same way you use ELMo.
Please visit https://github.com/Andy-jqa/probing_biomed_embeddings (currently under construction) for codes of probing experiments described in our paper.
Please cite the following paper if you use BioELMo:
@inproceedings{jin2019probing,
title={Probing Biomedical Embeddings from Language Models},
author={Jin, Qiao and Dhingra, Bhuwan and Cohen, William and Lu, Xinghua},
booktitle={Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for NLP},
pages={82--89},
year={2019}
}