Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

design/sequence decoder #4905

Merged
merged 14 commits into from
Nov 9, 2017

Conversation

Superjomn
Copy link
Contributor

An LoD-based Sequence Decoder (Beam Search)

refactor the beamSearch in RecurrentGradientMachine

## Beam Search
Beam search is a heuristic search algorithm that explores a graph by expanding the most promising node in a limited set.

It is the core component of Sequence Decoder.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sequence Decoder => *sequence decoder*


It is the core component of Sequence Decoder.

In the original implementation of `RecurrentGradientMachine`, the beam search is a method in RNN,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there more than one implementations of RecrurentGradientMachine? What is the other implementations besides the "original" one?

@@ -0,0 +1,436 @@
# A LoD-based Sequence Decoder
In tasks such as machine translation and image to text,
a **sequence decoder** is necessary to generate sequences.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a [sequence decoder](https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/README.md) is necessary

@@ -0,0 +1,436 @@
# A LoD-based Sequence Decoder
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Design: Sequence Decoder Generating LoDTensors

For example, the RNN sates, candidates IDs and probabilities of beam search can be represented as `LoDTensors`;
the selected candidate's IDs in each time step can be stored in a `TensorArray`, and `Packed` to the sentences translated.

## Changing LoD's absolute offset to relative offsets
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not part of the design. This is an issue. I copy-n-pasted the content into issue #4945

Copy link
Contributor

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some simple comments first.

```

the first level represents that there are two sequences,
their offsets in the second-level LoD is `[0, 3)` and `[3, 5)`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is here [3, 5) or [3, 6), I think it is the latter one.

the first level represents that there are two sequences,
their offsets in the second-level LoD is `[0, 3)` and `[3, 5)`.

The second level is the same with the relative offset example because the lower level is a tensor.
Copy link
Contributor

@lcy-seso lcy-seso Oct 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same as or the same to?

# for example
# decoder_mem.lod is
# [[0 1 3],
# [0 1 3 6]]
Copy link
Contributor

@lcy-seso lcy-seso Oct 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is decoder_mem here a sequence? Usually, the memory of an RNN is its hidden state in the last time step. Is memory here a TensorArray? and we decide to memorize hidden states in all the previous time step? If so, is this design compatible to the memory in dynamic RNN?

But I guess, maybe this memory has something different?

# its tensor content is [a1 a2 a3 a4 a5]
# which means there are 2 sentences to translate
# - the first sentence has 1 translation prefixes, the offsets are [0, 1)
# - the second sentence has 2 translation prefixes, the offsets are [1, 3) and [3, 6)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does here means memory need to memorize the entire unfinished prefixes? (This is required for beam search.)

# the following has 2, 3, 2, 3 candidates
# the encoder_ctx_expanded's content will be
# [a1 a1 a2 a2 a3 a3 a3 a4 a4 a5 a5 a5]
encoder_ctx_expanded = pd.lod_expand(encoder_ctx, target_word)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name target word is confusing. In the generation, we do not have target word.

# which means there are 2 sentences to translate
# - the first sentence has 1 translation prefixes, the offsets are [0, 1)
# - the second sentence has 2 translation prefixes, the offsets are [1, 3) and [3, 6)
# the target_word.lod is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name target word for generation is really confusing. For the generation, we only have source words to be translated into target words.

bias=None,
act=pd.activation.Softmax())
# topk_scores, a tensor, [None, k]
topk_scores, topk_ids = pd.top_k(scores)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This top_k is special. It is not only a simple "selecting top k items", but also "selecting the top k from a distribution to form a new batch". How do we handle this?

# selected_ids is the selected candidates that will be append to the translation
# selected_scores is the scores of the selected candidates
# generated_scores is the score of the translations(with candidates appended)
selected_ids, selected_scores, generated_scores = decoder.beam_search(
Copy link
Contributor

@lcy-seso lcy-seso Oct 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, I want to leave some of my questions here for our discussion.

The difficulties for dynamic beam search through some preliminary operators lies in (maybe not limited to):

  1. how to loop.
    • maybe this is done by dynamic RNN currently, but the loop is a generally used operation, should it be independent of RNN?
  2. how to stop the loop (the condition operation?).
    • Samples in a batch may hit different conditions, as a result, the batch size is dynamically changing.
    • Maybe all branches of a condition will be executed? I am not sure what is the design of the current condition operator, and do we decide to use it in beam search?
  3. How to construct the beam?
    • Dynamic expansion to form a larger batch but have to track the entire unfinished prefixes.
    • How the track the unfinished prefixes, and who to track this? It seems that in the current design the decoder memory tracks this?
  4. How to shrink the beam?

Copy link
Member

@jacquesqiao jacquesqiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Superjomn Superjomn merged commit 53cb4df into PaddlePaddle:develop Nov 9, 2017
@Superjomn Superjomn deleted the design/sequence_decoder branch November 9, 2017 05:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants