I don't maintain this repo anymore. There are now way better repos for you to find out keywords like this one
Deep Keyphrase extraction using SciBERT.
- Clone this repository and install
pytorch-pretrained-BERT
- From
scibert
repo, untar the weights (rename their weight dump file topytorch_model.bin
) and vocab file into a new foldermodel
. - Change the parameters accordingly in
experiments/base_model/params.json
. We recommend keeping batch size of 4 and sequence length of 512, with 6 epochs, if GPU's VRAM is around 11 GB. - For training, run the command
python train.py --data_dir data/task1/ --bert_model_dir model/ --model_dir experiments/base_model
- For eval, run the command,
python evaluate.py --data_dir data/task1/ --bert_model_dir model/ --model_dir experiments/base_model --restore_file best
We used IO format here. Unlike original SciBERT repo, we only use a simple linear layer on top of token embeddings.
On test set, we got:
- F1 score: 0.6259
- Precision: 0.5986
- Recall: 0.6558
- Support: 921
We used BIO format here. Overall F1 score was 0.4981 on test set.
Precision | Recall | F1-score | Support | |
---|---|---|---|---|
Process | 0.4734 | 0.5207 | 0.4959 | 870 |
Material | 0.4958 | 0.6617 | 0.5669 | 807 |
Task | 0.2125 | 0.2537 | 0.2313 | 201 |
Avg | 0.4551 | 0.5527 | 0.4981 | 1878 |
- Some tokens have more than one annotations. We did not consider multi-label classification.
- We only considered a linear layer on top of BERT embeddings. We need to see whether SciBERT + BiLSTM + CRF makes a difference.