export BIO_ELECTRA_HOME=/full/path/to/bio_electra/repository
For anything besides the transformers package used for BERT NER experiments, you need Tensorflow 1.15 and CUDA 10.0 for GPU. For BERT NER experiments on GPU, you need Tensorflow 2+ and Cuda 10.1 (i.e. another (virtual) machine) due to tranformers Python library requirements.
Ensure you have virtual environment support (e.g. for Ubuntu)
sudo apt-get install python3-venv
python3 -m venv --system-site-packages $BIO_ELECTRA_HOME/venv
source $BIO_ELECTRA_HOME/venv/bin/activate
pip install --upgrade pip
pip install tensorflow-gpu==1.15
pip install sklearn
pip install hyperopt
python3 -m venv --system-site-packages $BIO_ELECTRA_HOME/tf2_venv
source $BIO_ELECTRA_HOME/tf2_venv/bin/activate
pip install -U pip
pip install tensorflow-gpu==2.1
pip install transformers
pip install fastprogress
pip install seqeval
pip install torch torchvision
The pre-trained Bio-ELECTRA and Bio-ELECTRA++ small ELECTRA models are available at Zenodo.
For pretraining, you need the prepare your corpus into files one line per sentence and documents separated by an empty line and put them under a single directory.
The corpus is comprised of all PubMed abstracts with PMID >= 10,000,000 (19.2 million abstracts).
An example pretraining configuration is in the pmc_config.json.example
file. Please copy this file to pmc_config.json
and adjust the full paths according to your system's directory structure. The pretraining takes 3 weeks on a RTX 2070 8GB GPU.
Afterwards, assuming all of the preprocessed abstract files are under $BIO_ELECTRA_HOME/electra/data/electra_pretraining/pmc_abstracts
, you can run the following to generate the Bio-ELECTRA language representation model.
cd $BIO_ELECTRA_HOME/electra
./build_pmc_pretrain_dataset.sh
./pretrain_pmc_model.sh
The corpus is all open access full papers from PubMed. You need around 500GB or more free space for pretraining preprocessing and data generation.
The configuration parameters are in pmc_config_v2.json
file. The pretraining takes 3 weeks on a RTX 2070 8GB GPU.
An example pretraining configuration is in pmc_config_v2.json.example
file. Please copy this file to pmc_config_v2.json
and adjust the full paths according to your system's directory structure. The pretraining takes 3 weeks on a RTX 2070 8GB GPU.
cd $BIO_ELECTRA_HOME/electra
./build_pmc_oai_full_pretrain_dataset.sh
./pretrain_pmc_model_v2.sh
All of the datasets are available at $BIO_ELECTRA_HOME/electra/data/finetuning_data
.
train_bioasq_qa_baseline.sh # ELECTRA-Small++
train_bioasq_qa_pmc_1_8M.sh # Bio-ELECTRA
train_bioasq_qa_pmc_v2_3_6M.sh # Bio-ELECTRA++
After training, the results are stored under
$BIO_ELECTRA_HOME/electra/data/models/electra_small/results/
, $BIO_ELECTRA_HOME/electra/data/models/pmc_electra_small_1_8_M/results/
and
$BIO_ELECTRA_HOME/electra/data/models/pmc_electra_small_v2_3_6_M/results/
for Electra-Small++, Bio-ELECTRA and Bio-ELECTRA++, respectively.
For Bio-ELECTRA, copy the evaluation result files $BIO_ELECTRA_HOME/electra/data/models/pmc_electra_small_1_8_M/results/bioasq_results.txt
and $BIO_ELECTRA_HOME/electra/data/models/pmc_electra_small_1_8_M/results/bioasq_results.pkl
to
$BIO_ELECTRA_HOME/electra/pmc_results/qa_factoid/pmc_1_8M
directory.
Similarly, copy corresponding files for Electra-Small++ and Bio-ELECTRA++ from $BIO_ELECTRA_HOME/electra/data/models
directory to
$BIO_ELECTRA_HOME/electra/pmc_results/qa_factoid/baseline
and BIO_ELECTRA_HOME/electra/pmc_results/qa_factoid/pmc_v2_3_6M
, respectively.
Assuming the results are stored under $BIO_ELECTRA_HOME/electra/pmc_results/qa_factoid
python show_qa_performance.py --mode baseline # ELECTRA-Small++
python show_qa_performance.py --mode bio-electra
python show_qa_performance.py --mode bio-electra++
The yes/no question classification training/testing data is available at $BIO_ELECTRA_HOME/electra/data/finetuning_data/yesno
. This dataset has no development set.
train_yesno_baseline.sh # ELECTRA-Small++
train_yesno.sh # Bio-ELECTRA
train_yesno_v2_3_6M.sh # Bio-ELECTRA++
After training, the results are stored under
$BIO_ELECTRA_HOME/electra/data/models/electra_small/results/
, $BIO_ELECTRA_HOME/electra/data/models/pmc_electra_small_1_8_M/results/
and
$BIO_ELECTRA_HOME/electra/data/models/pmc_electra_small_v2_3_6_M/results/
for Electra-Small++, Bio-ELECTRA and Bio-ELECTRA++, respectively.
For Bio-ELECTRA, copy the evaluation result files $BIO_ELECTRA_HOME/electra/data/models/pmc_electra_small_1_8_M/results/yesno_results.txt
and $BIO_ELECTRA_HOME/electra/data/models/pmc_electra_small_1_8_M/results/yesno_results.pkl
to
$BIO_ELECTRA_HOME/electra/pmc_results/yesno/pmc_1_8M
directory.
Similarly, copy corresponding files for Electra-Small++ and Bio-ELECTRA++ from $BIO_ELECTRA_HOME/electra/data/models
directory to
$BIO_ELECTRA_HOME/electra/pmc_results/yesno/baseline
and BIO_ELECTRA_HOME/electra/pmc_results/yesno/pmc_v2_3_6M
, respectively.
Assuming the results are stored under $BIO_ELECTRA_HOME/electra/pmc_results/yes_no
the following will show Bio-ELECTRA, ELECTRA-Small++ and Bio-ELECTRA++ test results;
python yesno_perf_stats.py
The reranker training/testing data is available at $BIO_ELECTRA_HOME/data/bioasq_reranker
. This dataset is annotated by a single annotator and
has no developement set.
./train_reranker_baseline.sh # ELECTRA-Small++
./train_reranker.sh # Bio-ELECTRA
./train_reranker_v2_3_6M.sh # Bio-ELECTRA++
./train_weighted_reranker_baseline.sh # ELECTRA-Small++
./train_weighted_reranker.sh # Bio-ELECTRA
./train_weighted_reranker_v2_3_6M.sh # Bio-ELECTRA++
./predict_reranker_baseline.sh # ELECTRA-Small++
./predict_reranker.sh # Bio-ELECTRA
./predict_reranker_v2_3_6M.sh
./predict_weighted_reranker_baseline.sh
./predict_weighted_reranker.sh
./predict_weighted_reranker_v2_3_6M.sh
python show_reranker_performance.py --mode baseline
python show_reranker_performance.py --mode bio-electra
python show_reranker_performance.py --mode bio-electra++
python show_reranker_performance.py --mode weighted-baseline
python show_reranker_performance.py --mode weighted-bio-electra
python show_reranker_performance.py --mode weightede-bio-electra++
./train_re_gad_baseline.sh # ELECTRA-Small++
./train_re_gad.sh # Bio-ELECTRA
./train_re_gad_v2_3_6M.sh # Bio-ELECTRA++
./train_re_chemprot_baseline.sh # ELECTRA-Small++
./train_re_chemprot.sh # Bio-ELECTRA
./train_re_chemprot_v2_3_6M.sh # Bio-ELECTRA++
python show_re_performance.py --mode gad-baseline
python show_re_performance.py --mode gad-bio-electra
python show_re_performance.py --mode gad-bio-electra++
python show_re_performance.py --mode chemprot-baseline
python show_re_performance.py --mode chemprot-bio-electra
python show_re_performance.py --mode chemprot-bio-electra++
The datasets are located under the $BIO_ELECTRA_HOME/electra/data/finetuning_data
directory.
./train_bc4chemd_ner_baseline.sh # ELECTRA-Small++
./train_bc4chemd_ner.sh # Bio-ELECTRA
./train_bc4chemd_ner_v2_3_6M.sh # Bio-ELECTRA++
./train_bc2gm_ner_baseline.sh # ELECTRA-Small++
./train_bc2gm_ner.sh # Bio-ELECTRA
./train_bc2gm_ner_v2_3_6M.sh # Bio-ELECTRA++
./train_linnaeus_ner_baseline.sh # ELECTRA-Small++
./train_linnaeus_ner.sh # Bio-ELECTRA
./train_linnaeus_ner_v2_3_6M.sh # Bio-ELECTRA++
./train_ncbi_disease_ner_baseline.sh # ELECTRA-Small++
./train_ncbi_disease_ner.sh # Bio-ELECTRA
./train_ncbi_disease_ner_v2_3_6M.sh # Bio-ELECTRA++
After training, the results are stored under
$BIO_ELECTRA_HOME/electra/data/models/electra_small/results/
, $BIO_ELECTRA_HOME/electra/data/models/pmc_electra_small_1_8_M/results/
and
$BIO_ELECTRA_HOME/electra/data/models/pmc_electra_small_v2_3_6_M/results/
for Electra-Small++, Bio-ELECTRA and Bio-ELECTRA++, respectively.
For Bio-ELECTRA bc4chemd NER data set, copy the evaluation result files
$BIO_ELECTRA_HOME/electra/data/models/pmc_electra_small_1_8_M/results/bc4chemd_results.txt
and $BIO_ELECTRA_HOME/electra/data/models/pmc_electra_small_1_8_M/results/bc4chemd_results.txt
to
$BIO_ELECTRA_HOME/electra/pmc_results/ner/pmc_1_8M/bc4chemd
directory.
Similarly, copy corresponding files for Electra-Small++ and Bio-ELECTRA++ from $BIO_ELECTRA_HOME/electra/data/models
directory to
$BIO_ELECTRA_HOME/electra/pmc_results/ner/baseline/bc4chemd
and BIO_ELECTRA_HOME/electra/pmc_results/ner/pmc_v2_3_6M/bc4chemd
, respectively.
The other three NER datasets have the prefix bc2gm
, linnaeus
and ncbi_disease
.
Assuming the results are stored under $BIO_ELECTRA_HOME/electra/pmc_results/ner
the following will show Bio-ELECTRA, ELECTRA-Small++ and Bio-ELECTRA++ test results;
cd $BIO_ELECTRA_HOME/electra
python ner_perf_stats.py
./test_qa_bert_batch.sh
./qa_bert_perf_extract.sh > /tmp/bert_qa_perf.txt
python show_bert_perf_stats.py
./train_yesno_qc_bert_batch.sh
./test_yesno_qc_bert_batch.sh
python show_bert_yesno_perf_stats.py
./train_bert_reranker_batch.sh
python show_reranker_performance.py
./train_bio_re_gad_bert_batch.sh
./test_bio_re_gad_bert_batch.sh
python show_re_performance.py --mode gad
./train_bio_re_chemprot_bert_batch.sh
./test_bio_re_chemprot_bert_batch.sh
python show_re_performance.py --mode chemprot
The four NER datasets are located under $BIO_ELECTRA_HOME/bert_ner/data
directory.
The following scripts train ten randomly initialized models on the corresponding training sets and evaluate the models on their corresponding test sets.
cd $BIO_ELECTRA_HOME/bert_ner
./run_tf_BC4CHEMD_batch.sh
./run_tf_BC2GM_batch.sh
./run_tf_linnaeus_batch.sh
./run_tf_NCBI_disease_batch.sh
python perf_stats.py --mode bc4chemd
python perf_stats.py --mode bc2gm
python perf_stats.py --mode linnaeus
python perf_stats.py --mode ncbi-disease