-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add conv sequence to sequence model #686
add conv sequence to sequence model #686
Conversation
fluid/conv_seq_to_seq/utils.py
Outdated
self.atom = Atom(pd.fc, "fc", size=size, dropout=dropout) | ||
|
||
def __call__(self, x): | ||
# pd.fc take dims[1:] as projecton's input, need reshape to avoid that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The num_flatten_dims
attribute of fc
can be used to reshape the input shape, but it doesn't matter and just a reminder.
fluid/conv_seq_to_seq/model.py
Outdated
|
||
self.fc2 = Linear(embed_dim) | ||
|
||
def forward(self, src_tokens, src_positions): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can also add __call__
in ConvEncoder
to unify the interface with other modules and layers. It is also the way to call modules in Pytorch and fairseq-py. Just personal preference and feel free about it.
fluid/conv_seq_to_seq/model.py
Outdated
self.fc2 = Linear(out_embed_dim) | ||
self.fc3 = Linear(dict_size + 1, dropout=dropout) | ||
|
||
def forward(self, prev_output_tokens, prev_positions, encoder_out): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can also add __call__
in ConvDecoder
to unify the interface with other modules and layers. It is also the way to call modules in Pytorch and fairseq-py. Just personal preference and feel free about it.
fluid/conv_seq_to_seq/model.py
Outdated
pos_pad_id, | ||
dropout=0.1): | ||
self.dropout = dropout | ||
self.embed_tokens = Embedding(dict_size + 1, embed_dim, pad_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess dict_size + 1
is used to include the padding index, and I personally suggest make this special detail outside the module for a common module.
fluid/conv_seq_to_seq/model.py
Outdated
in_channels, | ||
out_channels * 2, | ||
kernel_size, | ||
# padding=pad, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the decoder, to guarantee no future information is available, left_pad, right_pad = [kernel_size - 1, 0]
might be proper according to the Pytorch code and paper. But currently layers.conv
only can pad the same sizes for left and right which Pytorch also suffers from (It seems that layers.sequence_conv
support this padding schema but the python wrapper need to be enhanced). To achieve left_pad, right_pad = [kernel_size - 1, 0]
, we may need to use pad = kernel_size - 1
for conv
followed by slice/split
to remove future time steps created by padding, which is also fairseq-py uses. The following is the corresponding code but I can't find the code in the latest fairseq-py repo.
for i, (out_channels, kernel_size) in enumerate(convolutions):
pad = kernel_size - 1
self.projections.append(Linear(in_channels, out_channels)
if in_channels != out_channels else None)
self.convolutions.append(
LinearizedConv1d(in_channels, out_channels * 2, kernel_size,
padding=pad, dropout=dropout))
self.attention.append(AttentionLayer(out_channels, embed_dim)
if attention[i] else None)
in_channels = out_channels
for proj, conv, attention in zip(self.projections, self.convolutions, self.attention):
residual = x if proj is None else proj(x)
x = F.dropout(x, p=self.dropout, training=self.training)
x = conv(x)
x = conv.remove_future_timesteps(x)
x = F.glu(x)
def remove_future_timesteps(self, x):
"""Remove future time steps created by padding."""
if not self._is_incremental_eval and self.kernel_size[0] > 1 and self.padding[0] > 0:
x = x[:-self.padding[0], :, :]
return x
fluid/conv_seq_to_seq/utils.py
Outdated
|
||
x = self.atom(x) | ||
|
||
x = Op.reshape(x, (B, T, -1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it encounter the same problem as Transformer since both B
and T
may change. Maybe we can use (B, -1, C)
to relax the sequence length, but I am not sure if it will effect the infershape
of conv
. Also the reshape_op is enhanced. PaddlePaddle/Paddle#9008
fluid/conv_seq_to_seq/model.py
Outdated
self.attention): | ||
residual = x if proj is None else proj(x) | ||
x = Op.dropout(x, self.dropout) | ||
x = conv(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the wrapped Conv1D
requests input with shape B x T x C
, but the x = self._transpose_if_training(x)
before seems to change the shape to T x B x C
. I am a little confused about this.
I can see that Conv1D
can work correctly in encoder since the input shape is B x T x C
. While the shape of conv
input here is T x B x C
which may not be suitable to the wrapped Conv1D
. Since the wrapped Conv1D
contains a keep-length padding scheme, which can make the input and output have the same shape, it may run but I feel the computation might be incorrect.
Here list the link in fairseq-py for reference: the ConvTBC
(convolution for encoder) and LinearizedConv1d
(convolution for decoder) both requests input with shape T x B x C
which differs with our here wrapped Conv1D
.
fluid/conv_seq_to_seq/model.py
Outdated
x = fluid.nets.glu(x, dim=2) | ||
|
||
if attention is not None: | ||
x = self._transpose_if_training(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is also used when the shape of input x
is T x B x C
and to convert the shape to B x T x C
for attention calculations.
Maybe we can try to remove _transpose_if_training
and _split_encoder_out
included transpose
and unify the shapes to B x T x C
temporarily if we can confirm that our here wrapped Conv1D
is not suitable to T x B x C
.
I am not sure the reason why fairseq-py uses T x B x C
shape for convolution, maybe it is faster according to the annotations.
fluid/conv_seq_to_seq/utils.py
Outdated
if sent[-1] != end_id: | ||
sent.append(end_id) | ||
if len(sent) < max_len: | ||
sent += [pad_id for i in xrange(max_len - len(sent))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can pad the batch data according to the max length in current batch and allow variant max sequence length across batches, but this may also affect the model configurations such as the above x = Op.reshape(x, (B, T, -1))
fluid/conv_seq_to_seq/model.py
Outdated
encoder_a, encoder_b = encoder_out | ||
# here just a trick | ||
encoder_a = Op.transpose(encoder_a, [1, 2]) | ||
encoder_b = Op.transpose(encoder_b, [1, 2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this because encoder_a, encoder_b = self._split_encoder_out(encoder_out)
convert the shape from B x T x C
to B x C x T
.
We may can remove these two transpose
and use x = pd.matmul(x, encoder_a, transpose_y=False)
for dot product calculations.
Because the organization of the directory has been changed. Can you move the codes from fluid into fluid/neural_machine_translation and let us merge this PR as soon as possible (once pass the code review). |
fixes: #687