-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* add biobert tutorial Signed-off-by: Yang Zhang <[email protected]>
- Loading branch information
Showing
5 changed files
with
1,324 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,341 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"\"\"\"\n", | ||
"You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n", | ||
"\n", | ||
"Instructions for setting up Colab are as follows:\n", | ||
"1. Open a new Python 3 notebook.\n", | ||
"2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n", | ||
"3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n", | ||
"4. Run this cell to set up dependencies.\n", | ||
"\"\"\"\n", | ||
"# If you're using Google Colab and not running locally, run this cell.\n", | ||
"# !pip install wget\n", | ||
"# !pip install git+https://github.com/NVIDIA/apex.git\n", | ||
"# !pip install nemo_toolkit[nlp]\n", | ||
"# !pip install unidecode\n", | ||
"import os\n", | ||
"import nemo\n", | ||
"import nemo.collections.nlp as nemo_nlp\n", | ||
"import numpy as np\n", | ||
"import time\n", | ||
"import errno\n", | ||
"\n", | ||
"from nemo.backends.pytorch.common.losses import CrossEntropyLossNM\n", | ||
"from nemo.collections.nlp.nm.data_layers import BertTokenClassificationDataLayer\n", | ||
"from nemo.collections.nlp.nm.trainables import TokenClassifier\n", | ||
"from nemo.collections.nlp.callbacks.token_classification_callback import eval_epochs_done_callback, eval_iter_callback\n", | ||
"from nemo.utils.lr_policies import get_lr_policy\n", | ||
"from nemo import logging" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Introduction\n", | ||
"BioBERT has the same network architecture as the original BERT, but instead of Wikipedia and BookCorpus it is pretrained on PubMed, a large biomedical text corpus, which achieves better performance in biomedical downstream tasks, such as question answering(QA), named entity recognition(NER) and relationship extraction(RE). This model was trained for 1M steps. For more information please refer to the original paper https://academic.oup.com/bioinformatics/article/36/4/1234/5566506. For details about BERT please refer to https://ngc.nvidia.com/catalog/models/nvidia:bertbaseuncasedfornemo.\n", | ||
"\n", | ||
"\n", | ||
"In this notebook we're going to showcase how to train BioBERT on a biomedical named entity recognition (NER) dataset." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Download model checkpoint\n", | ||
"Download BioBert/BioMegatron checkpoints from NGC: https://ngc.nvidia.com/catalog/models and put the encoder weights \n", | ||
"at `./checkpoints/biobert/BERT.pt` or `./checkpoints/biomegatron/BERT.pt` and the model configuration file at `./checkpoints/biobert/bert_config.json` or `./checkpoints/biomegatron/bert_config.json`." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Set which model to use.\n", | ||
"model_type=\"biobert\" # \"biomegatron\"\n", | ||
"base_checkpoint_path={'biobert': './checkpoints/biobert/', 'biomegatron': './checkpoints/biomegatron/'}\n", | ||
"pretrained_model_name={'biobert': 'bert-base-cased', 'biomegatron': 'megatron-bert-uncased'}\n", | ||
"do_lower_case={'biobert': False, 'biomegatron': True}" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# the checkpoints are available from NGC: https://ngc.nvidia.com/catalog/models\n", | ||
"CHECKPOINT_ENCODER = os.path.join(base_checkpoint_path[model_type], 'BERT.pt') # Model encoder checkpoint file\n", | ||
"CHECKPOINT_CONFIG = os.path.join(base_checkpoint_path[model_type], 'bert_config.json') # Model configuration file\n", | ||
" \n", | ||
"if not os.path.exists(CHECKPOINT_ENCODER):\n", | ||
" raise OSError(errno.ENOENT, os.strerror(errno.ENOENT), CHECKPOINT_ENCODER)\n", | ||
"\n", | ||
"if not os.path.exists(CHECKPOINT_CONFIG):\n", | ||
" raise OSError(errno.ENOENT, os.strerror(errno.ENOENT), CHECKPOINT_CONFIG)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Download training data\n", | ||
"In this example we download the NER dataset NCBI-disease using token_classification/get_medical_data.py to ./datasets/ncbi-disease" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"data_dir=\"./datasets\"\n", | ||
"dataset=\"ncbi-disease\"\n", | ||
"!mkdir -p $data_dir\n", | ||
"!python ../token_classification/get_medical_data.py --data_dir=$data_dir --dataset=$dataset\n", | ||
"!python ../token_classification/import_from_iob_format.py --data_file=$data_dir/$dataset/train.tsv\n", | ||
"!python ../token_classification/import_from_iob_format.py --data_file=$data_dir/$dataset/test.tsv\n", | ||
"!python ../token_classification/import_from_iob_format.py --data_file=$data_dir/$dataset/dev.tsv\n", | ||
"!ls -l $data_dir/$dataset" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"After the previous step, you should have a ./datasets/ncbi-disease folder that contains the following files:\n", | ||
"- labels_train.txt\n", | ||
"- labels_dev.txt\n", | ||
"- labels_text.txt\n", | ||
"- text_train.txt\n", | ||
"- text_dev.txt\n", | ||
"- text_text.txt\n", | ||
"\n", | ||
"The format of the data described in NeMo docs." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Create Neural Modules" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"model_checkpoint=CHECKPOINT_ENCODER # language model encoder file\n", | ||
"model_config=CHECKPOINT_CONFIG # model configuration file\n", | ||
"train_data_text_file=f\"{data_dir}/{dataset}/text_train.txt\"\n", | ||
"train_data_label_file=f\"{data_dir}/{dataset}/labels_train.txt\"\n", | ||
"eval_data_text_file=f\"{data_dir}/{dataset}/text_dev.txt\"\n", | ||
"eval_data_label_file=f\"{data_dir}/{dataset}/labels_dev.txt\"\n", | ||
"none_label=\"O\" \n", | ||
"num_labels=3 # this should be the same number as number of labels in the training data\n", | ||
"fc_dropout=0.1\n", | ||
"max_seq_length=128\n", | ||
"batch_size=32\n", | ||
"\n", | ||
"nf = nemo.core.NeuralModuleFactory(\n", | ||
" backend=nemo.core.Backend.PyTorch,\n", | ||
" placement=nemo.core.DeviceType.GPU\n", | ||
")\n", | ||
"model = nemo_nlp.nm.trainables.get_pretrained_lm_model(\n", | ||
" config=model_config, pretrained_model_name=pretrained_model_name[model_type], checkpoint=model_checkpoint\n", | ||
" )\n", | ||
"tokenizer = nemo.collections.nlp.data.tokenizers.get_tokenizer(\n", | ||
" tokenizer_name='nemobert',\n", | ||
" pretrained_model_name=pretrained_model_name[model_type],\n", | ||
" do_lower_case=do_lower_case[model_type]\n", | ||
")\n", | ||
"hidden_size = model.hidden_size\n", | ||
"classifier = TokenClassifier(hidden_size=hidden_size, num_classes=num_labels, dropout=fc_dropout, num_layers=1)\n", | ||
"task_loss = CrossEntropyLossNM(logits_ndim=3)\n", | ||
"train_data_layer = BertTokenClassificationDataLayer(\n", | ||
" tokenizer=tokenizer,\n", | ||
" text_file=train_data_text_file,\n", | ||
" label_file=train_data_label_file,\n", | ||
" pad_label=none_label,\n", | ||
" label_ids=None,\n", | ||
" max_seq_length=max_seq_length,\n", | ||
" batch_size=batch_size,\n", | ||
" shuffle=True,\n", | ||
" use_cache=True\n", | ||
")\n", | ||
"eval_data_layer = BertTokenClassificationDataLayer(\n", | ||
" tokenizer=tokenizer,\n", | ||
" text_file=eval_data_text_file,\n", | ||
" label_file=eval_data_label_file,\n", | ||
" pad_label=none_label,\n", | ||
" label_ids=train_data_layer.dataset.label_ids,\n", | ||
" max_seq_length=max_seq_length,\n", | ||
" batch_size=batch_size,\n", | ||
" shuffle=False,\n", | ||
" use_cache=False,\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Creating Neural graph" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"train_data = train_data_layer()\n", | ||
"train_hidden_states = model(input_ids=train_data.input_ids, token_type_ids=train_data.input_type_ids, attention_mask=train_data.input_mask)\n", | ||
"train_logits = classifier(hidden_states=train_hidden_states)\n", | ||
"loss = task_loss(logits=train_logits, labels=train_data.labels, loss_mask=train_data.loss_mask)\n", | ||
"# If you're training on multiple GPUs, this should be\n", | ||
"# len(train_data_layer) // (batch_size * batches_per_step * num_gpus)\n", | ||
"train_steps_per_epoch = len(train_data_layer) // batch_size\n", | ||
"logging.info(f\"doing {train_steps_per_epoch} steps per epoch\")\n", | ||
"\n", | ||
"eval_data = eval_data_layer()\n", | ||
"eval_hidden_states = model(input_ids=eval_data.input_ids, token_type_ids=eval_data.input_type_ids, attention_mask=eval_data.input_mask)\n", | ||
"eval_logits = classifier(hidden_states=eval_hidden_states)\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Create Callbacks\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"\n", | ||
"train_callback = nemo.core.SimpleLossLoggerCallback(\n", | ||
" tensors=[loss],\n", | ||
" print_func=lambda x: logging.info(\"Loss: {:.3f}\".format(x[0].item())),\n", | ||
" get_tb_values=lambda x: [[\"loss\", x[0]]],\n", | ||
" step_freq=100,\n", | ||
" tb_writer=nf.tb_writer,\n", | ||
")\n", | ||
"\n", | ||
"# Callback to evaluate the model\n", | ||
"eval_callback = nemo.core.EvaluatorCallback(\n", | ||
" eval_tensors=[eval_logits, eval_data.labels, eval_data.subtokens_mask],\n", | ||
" user_iter_callback=lambda x, y: eval_iter_callback(x, y),\n", | ||
" user_epochs_done_callback=lambda x: eval_epochs_done_callback(x, train_data_layer.dataset.label_ids, f'{nf.work_dir}/graphs'),\n", | ||
" tb_writer=nf.tb_writer,\n", | ||
" eval_step=100\n", | ||
" )" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"\n", | ||
"# Training\n", | ||
"Training could take several minutes." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"\n", | ||
"num_epochs=10\n", | ||
"lr_warmup_proportion=0.1\n", | ||
"lr=4e-5\n", | ||
"weight_decay=0.01\n", | ||
"lr_policy_fn = get_lr_policy(\"WarmupAnnealing\", total_steps=num_epochs * train_steps_per_epoch, warmup_ratio=lr_warmup_proportion\n", | ||
")\n", | ||
"nf.train(\n", | ||
" tensors_to_optimize=[loss],\n", | ||
" callbacks=[train_callback, eval_callback],\n", | ||
" lr_policy=lr_policy_fn,\n", | ||
" optimizer=\"adam_w\",\n", | ||
" optimization_params={\"num_epochs\": num_epochs, \"lr\": lr, \"weight_decay\": weight_decay},\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"The result should look something like\n", | ||
"```\n", | ||
"[NeMo I 2020-05-22 17:13:48 token_classification_callback:82] Accuracy: 0.9882348032875798\n", | ||
"[NeMo I 2020-05-22 17:13:48 token_classification_callback:86] F1 weighted: 98.82\n", | ||
"[NeMo I 2020-05-22 17:13:48 token_classification_callback:86] F1 macro: 93.74\n", | ||
"[NeMo I 2020-05-22 17:13:48 token_classification_callback:86] F1 micro: 98.82\n", | ||
"[NeMo I 2020-05-22 17:13:49 token_classification_callback:89] precision recall f1-score support\n", | ||
" \n", | ||
" O (label id: 0) 0.9938 0.9957 0.9947 22092\n", | ||
" B (label id: 1) 0.8843 0.9034 0.8938 787\n", | ||
" I (label id: 2) 0.9505 0.8982 0.9236 1090\n", | ||
" \n", | ||
" accuracy 0.9882 23969\n", | ||
" macro avg 0.9429 0.9324 0.9374 23969\n", | ||
" weighted avg 0.9882 0.9882 0.9882 23969\n", | ||
"```" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.7.6" | ||
}, | ||
"pycharm": { | ||
"stem_cell": { | ||
"cell_type": "raw", | ||
"metadata": { | ||
"collapsed": false | ||
}, | ||
"source": [] | ||
} | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
Oops, something went wrong.