-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[won't merge - v1 codebase] Bert #1543
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick review of FAQ.md
minor changes to make code simpler/more explicit
@Zenglinxiao don't you think it could be possible to embed "bert_build_model" within "build_model" ? |
Do you have some comparison between BERT and w/o BERT on NMT tasks? |
@rajarsheem this is not the aim of BERT or this PR. You have some examples of what it's for in the |
onmt/encoders/bert.py
Outdated
|
||
class BertLayerNorm(nn.Module): | ||
def __init__(self, hidden_size, eps=1e-12): | ||
"""Layernorm module in the TF style(epsilon inside the square root). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does PyTorch implement it differently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I adapt the BertLayerNorm from huggingface/pytorch-transformers as it indicate "epsilon inside the square root", and i checked Pytorch doc, it use eps=1e-5, so i kept that. Thanks to your remind, I find it's the same as i go deep into pytorch cpp code. The origin implementation is due to the typo of Pytorch doc which is fix already (huggingface/transformers#1089).
I'll switch that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check:
BertLayerNorm switched to pytorch original LayerNorm.
BertAdam switched into AdamW as well but add option correct_bias to not compensate for bias, as in original BERT.
@@ -191,40 +213,62 @@ def build_base_model(model_opt, fields, gpu, checkpoint=None, gpu_id=None): | |||
pad_idx = tgt_base_field.vocab.stoi[tgt_base_field.pad_token] | |||
generator = CopyGenerator(model_opt.dec_rnn_size, vocab_size, pad_idx) | |||
|
|||
if model_opt.is_bert: | |||
model = encoder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the encoder becomes the whole model? If there is no decoder, then how can I use the model for machine translation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BERT stands for Bidirectional Encoder Representations from Transformers. It is not a machine translation model.
https://arxiv.org/abs/1810.04805
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this :) But this is a PR to a machine translation system. If the machine translation cannot benefit from this PR then why bother? If you want to use BERT for basic downstream tasks (e.g. classification), then you can use the Hugging Face's implementation or the reference implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OpenNMT-py is not only machine translation, there are other tasks such as Speech2Text, Image2Text, Vid2Text, etc.
This PR is meant to add BERT task(s) and components to the framework to allow users to experiment. Nothing stops you to try and build a custom model using BERT pre-trained encoder layers and standard Transformer decoder layers for instance, but I think it has been tried without great benefits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, but AFAIK, all the tasks you mentioned (i.e. Speech2Text, Image2Text, Video2Text) are formalised as transduction problems and, therefore, require some sort of a decoder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. All those things are not mutually exclusive.
Feel free to share any research on the forum and eventually PR if you have anything interesting regarding using BERT in a seq2seq task.
I would like to using BERT in a seq2seq task, but I'm confused about how to preprocess data. Would you like to give me some advice? |
Add BERT into OpenNMT-py.