-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
design/sequence decoder #4905
design/sequence decoder #4905
Conversation
doc/design/ops/sequence_decoder.md
Outdated
## Beam Search | ||
Beam search is a heuristic search algorithm that explores a graph by expanding the most promising node in a limited set. | ||
|
||
It is the core component of Sequence Decoder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sequence Decoder
=> *sequence decoder*
doc/design/ops/sequence_decoder.md
Outdated
|
||
It is the core component of Sequence Decoder. | ||
|
||
In the original implementation of `RecurrentGradientMachine`, the beam search is a method in RNN, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there more than one implementations of RecrurentGradientMachine
? What is the other implementations besides the "original" one?
doc/design/ops/sequence_decoder.md
Outdated
@@ -0,0 +1,436 @@ | |||
# A LoD-based Sequence Decoder | |||
In tasks such as machine translation and image to text, | |||
a **sequence decoder** is necessary to generate sequences. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a [sequence decoder](https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/README.md) is necessary
doc/design/ops/sequence_decoder.md
Outdated
@@ -0,0 +1,436 @@ | |||
# A LoD-based Sequence Decoder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Design: Sequence Decoder Generating LoDTensors
For example, the RNN sates, candidates IDs and probabilities of beam search can be represented as `LoDTensors`; | ||
the selected candidate's IDs in each time step can be stored in a `TensorArray`, and `Packed` to the sentences translated. | ||
|
||
## Changing LoD's absolute offset to relative offsets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not part of the design. This is an issue. I copy-n-pasted the content into issue #4945
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some simple comments first.
``` | ||
|
||
the first level represents that there are two sequences, | ||
their offsets in the second-level LoD is `[0, 3)` and `[3, 5)`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is here [3, 5)
or [3, 6)
, I think it is the latter one.
the first level represents that there are two sequences, | ||
their offsets in the second-level LoD is `[0, 3)` and `[3, 5)`. | ||
|
||
The second level is the same with the relative offset example because the lower level is a tensor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same as or the same to?
# for example | ||
# decoder_mem.lod is | ||
# [[0 1 3], | ||
# [0 1 3 6]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is decoder_mem
here a sequence? Usually, the memory of an RNN is its hidden state in the last time step. Is memory here a TensorArray
? and we decide to memorize hidden states in all the previous time step? If so, is this design compatible to the memory in dynamic RNN?
But I guess, maybe this memory has something different?
# its tensor content is [a1 a2 a3 a4 a5] | ||
# which means there are 2 sentences to translate | ||
# - the first sentence has 1 translation prefixes, the offsets are [0, 1) | ||
# - the second sentence has 2 translation prefixes, the offsets are [1, 3) and [3, 6) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does here means memory need to memorize the entire unfinished prefixes? (This is required for beam search.)
# the following has 2, 3, 2, 3 candidates | ||
# the encoder_ctx_expanded's content will be | ||
# [a1 a1 a2 a2 a3 a3 a3 a4 a4 a5 a5 a5] | ||
encoder_ctx_expanded = pd.lod_expand(encoder_ctx, target_word) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name target word is confusing. In the generation, we do not have target word.
# which means there are 2 sentences to translate | ||
# - the first sentence has 1 translation prefixes, the offsets are [0, 1) | ||
# - the second sentence has 2 translation prefixes, the offsets are [1, 3) and [3, 6) | ||
# the target_word.lod is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name target word
for generation is really confusing. For the generation, we only have source words to be translated into target words.
doc/design/ops/sequence_decoder.md
Outdated
bias=None, | ||
act=pd.activation.Softmax()) | ||
# topk_scores, a tensor, [None, k] | ||
topk_scores, topk_ids = pd.top_k(scores) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This top_k
is special. It is not only a simple "selecting top k items", but also "selecting the top k from a distribution to form a new batch". How do we handle this?
doc/design/ops/sequence_decoder.md
Outdated
# selected_ids is the selected candidates that will be append to the translation | ||
# selected_scores is the scores of the selected candidates | ||
# generated_scores is the score of the translations(with candidates appended) | ||
selected_ids, selected_scores, generated_scores = decoder.beam_search( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, I want to leave some of my questions here for our discussion.
The difficulties for dynamic beam search through some preliminary operators lies in (maybe not limited to):
- how to loop.
- maybe this is done by dynamic RNN currently, but the
loop
is a generally used operation, should it be independent of RNN?
- maybe this is done by dynamic RNN currently, but the
- how to stop the loop (the condition operation?).
- Samples in a batch may hit different conditions, as a result, the batch size is dynamically changing.
- Maybe all branches of a condition will be executed? I am not sure what is the design of the current condition operator, and do we decide to use it in beam search?
- How to construct the beam?
- Dynamic expansion to form a larger batch but have to track the entire unfinished prefixes.
- How the track the unfinished prefixes, and who to track this? It seems that in the current design the decoder memory tracks this?
- How to shrink the beam?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
An LoD-based Sequence Decoder (Beam Search)
refactor the
beamSearch
inRecurrentGradientMachine