MILABOT : A Deep Reinforcement Learning Chatbot

paper : https://arxiv.org/abs/1709.02349

Summary

2017 Amazon Alexa Prize competition에 참가한 팀. 22개의 response model로 응답을 만든 뒤, response selection policy로 가장 적절한 응답을 선택했다. 응답 선택 모델은 AMT에서 수집한 데이터로 학습한 MLP, REINFORCE을 변형한 Off-policy REINFORCE, Abstract Discourse MDP를 이용한 Q-learning 등을 실험했다.

https://www.youtube.com/watch?v=aUwJKCMdqmo

Amazon Alexa Prize competition

Process
- Sponsored & unsponsored teams
- First customer feedback period
- Semifinals
- Final customer feedback period
2017 result : https://developer.amazon.com/alexaprize/2017-alexa-prize
2018 competition began : https://developer.amazon.com/alexaprize

Response Models

Template

Alicebot : www.alicebot.org API
Elizabot : Eliza system, designed to mimic a Rogerian psychotherapist.
Initiatorbot : 40 question phrases. "What did you do today?", "Do you have pets?", "Did you know that ?"
Storybot : if a user requests a story
- "Alright, let me tell you the story <story_title> <story_body> by <story_author>".
- request word (e.g. say, tell.) and story-type word in the utterance (e.g. story, tale)
- www.english-for-students.com

Knowledge Base-based Question Answering

Evibot : Amazon’s question-answering web-service Evi: www.evi.com. handle factual questions.
BoWMovies : template-based response model, which handles questions in the movie domain.
- entity names and tags (e.g. movie plot and release year)
- uses word embeddings to match tags

Retrieval-based Neural Networks

VHRED models : A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues
- 6 variants
SkipThought Vector Models
- semantic relatedness score
- compares the user’s last utterance to a set of trigger phrases
- 315 trigger phrases
- 35 response sets.
- If the model did not find a match in the first step, the model selects its response from among all Reddit dataset responses
Dual Encoder Models
- two sequence encoders ENC_Q and ENC_R
- encode the dialogue history and a candidate response
- 2 variants
Bag-of-words Retrieval Models
- bag-of-words retrieval models based on TF-IDF Glove word embeddings and Word2Vec embeddings
- retrieve the response with the highest cosine similarity.
- BoWWashingtonPost, BoWTrump, BoWFactGenerator, BoWGameofThrones

Retrieval-based Logistic Regression

35 topic-independent, generic pre-defined responses
logistic regression classifier to select its response based on a set of higher-level features.
"Could you repeat that again", "I don’t know" and "Was that a question?"

Search Engine-based Neural Networks

LSTMClassifierMSMarco
- searches the web
- retrieves the first 10 search snippets
- bidirectional LSTM to embed the last dialogue utterance and the snippet
- concatenated and passed through an MLP
- predict a scalar-value between 0 − 1 indicating how appropriate the snippet is as a response
- classifier trained on Microsoft Marco dataset

Generation-based Neural Networks

GRUQuestion- Generator
- two GRU layers
- generate follow-up questions

reponses

statistics

Model Selection Policy

Action-value Parametrization

action-value

Stochastic Policy Parametrization

policy

both scoring model Q, f have the same parametrization

Input Features

1458 features based on the given dialogue history

combination of word embeddings : Average of the word embeddings
dialogue acts
part-of-speech tags
unigram word overlap : 1.0 when one or more non-stop-words overlap between candidate response and last user utterance
bigram word overlap
model-specific features
No RNN

Scoring model’s architecture

scoring model

Supervised AMT: Learning with Crowdsourced Labels

Amazon Mechanical Turk (AMT)
collected 199, 678 labels.

Supervised Learned Reward: Learning with a Learned Reward Function

not clear whether AMt score is correlated with scores given by real-world Alexa users
learning to predict the Alexa user scores based on previously recorded dialogues.
linear regression
23 features
- AMT label class
- Generic response
- Response length
- Dialogue act (request, a question, a statement or contains profanity)
- Sentiment class
- Generic user utterance
- User utterance length
- Confusion indicator
- Dialogue length
As training data is scarce, we use only higher-level features
4340 dialogues

first initialize the model with the parameters of the Supervised AMT scoring model, and then fine-tune it with the reward model outputs

Off-policy REINFORCE

off-policy

off

Abstract Discourse Markov Decision Process

mdp

Z = ZDialogue act × ZUser sentiment × ZGeneric user utterance
Dialogue act : {Accept, Reject, Request, Politics, Generic Question, Personal Question, Statement, Greeting, Goodbye,Other}.
The transition distribution is parametrized by three independent two-layer MLP models, which take as input the same features as the scoring function
has the potential to learn to take into account future states of the dialogue when selecting its action. This is in contrast to policies learned using supervised learning, which do not consider future dialogue states.

Preliminary Evaluation

AMT Evaluation

amt