-
Notifications
You must be signed in to change notification settings - Fork 568
/
Copy pathcomplex-decisions-exercises.tex
380 lines (339 loc) · 16.7 KB
/
complex-decisions-exercises.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
%%%% 17.1: Sequential Decision Problems (6 exercises, 4 labelled)
%%%% ============================================================
\begin{uexercise}[mdp-model-exercise]%
For the \(4\stimes 3\) world shown in
\figref{sequential-decision-world-figure}, calculate which squares
can be reached from (1,1) by the action sequence \([\J{Up},\J{Up},\J{Right},\J{Right},\J{Right}]\) and with what
probabilities. Explain how this computation is related to the prediction task (see \secref{general-filtering-section}) for a hidden Markov model.
\end{uexercise}
% id=17.0 section=17.1
\begin{iexercise}[mdp-model-exercise]%
For the \(4\stimes 3\) world shown in
\figref{sequential-decision-world-figure}, calculate which squares
can be reached from (1,1) by the action sequence \([\J{Right},\J{Right},\J{Right},\J{Up},\J{Up}]\) and with what
probabilities. Explain how this computation is related to the prediction task (see \secref{general-filtering-section}) for a hidden Markov model.
\end{iexercise}
% id=17.0 section=17.1
\begin{uexercise}
Select a specific member of the set of policies that are optimal for
\(R(s)>0\) as shown in \figref{sequential-decision-policies-figure}(b),
and calculate the fraction of time the agent spends in each state, in the limit,
if the policy is executed forever. ({\em Hint}: Construct the state-to-state
transition probability matrix corresponding to the policy and see
\exref{markov-convergence-exercise}.)
\end{uexercise}
% id=17.1 section=17.1
\begin{exercise}[nonseparable-exercise]%
Suppose that we define the utility of a state sequence to be the {\em
maximum} reward obtained in any state in the sequence. Show that this
utility function does not result in stationary preferences between
state sequences. Is it still possible to define a utility function on
states such that MEU decision making gives optimal behavior?
\end{exercise}
% id=17.2 section=17.1
\begin{iexercise}
Can any finite search problem be translated exactly into a Markov
decision problem such that an optimal solution of the latter is also
an optimal solution of the former? If so, explain {\em precisely} how
to translate the problem and how to translate the solution back; if
not, explain {\em precisely} why not (i.e., give a counterexample).
\end{iexercise}
% id=17.3 section=17.1
\begin{exercise}[reward-equivalence-exercise]
Sometimes MDPs are formulated with a reward function \(R(s,a)\) that
depends on the action taken or with a reward function
\(R(s,a,s')\) that also depends on the outcome state.
\begin{enumerate}
\item Write the Bellman equations for these formulations.
\item Show how an MDP with reward function \(R(s,a,s')\)
can be transformed into a different MDP with reward function \(R(s,a)\),
such that optimal policies in the new MDP correspond exactly to
optimal policies in the original MDP.
\item Now do the same to convert MDPs with \(R(s,a)\)
into MDPs with \(R(s)\).
\end{enumerate}
\end{exercise}
% id=17.5 section=17.1
\begin{exercise}[threshold-cost-exercise]%
\prgex
For the environment shown in
\figref{sequential-decision-world-figure}, find all the threshold
values for \(R(s)\) such that the optimal policy changes
when the threshold is crossed. You will need a way to calculate the
optimal policy and its value for fixed \(R(s)\). ({\em Hint}: Prove that
the value of any fixed policy varies linearly with \(R(s)\).)
\end{exercise}
% id=17.8 section=17.1
%%%% 17.2: Value Iteration (4 exercises, 3 labelled)
%%%% ===============================================
\begin{exercise}[vi-contraction-exercise]
\eqref{vi-contraction-equation} on \pgref{vi-contraction-equation} states
that the Bellman operator is a contraction.
\begin{enumerate}
\item Show that, for any functions \(f\) and \(g\),
\[ |\max_a f(a) - \max_a g(a)| \leq \max_a |f(a) - g(a)|\ .\]
\item Write out an expression for \(|(B\,U_i - B\,U'_i)(s)|\)
and then apply the result from (a) to complete the proof
that the Bellman operator is a contraction.
\end{enumerate}
\end{exercise}
% id=17.7 section=17.2
\begin{exercise}
This exercise considers two-player MDPs that correspond to
zero-sum, turn-taking games like those in \chapref{game-playing-chapter}.
Let the players be \(A\) and \(B\), and let \(R(s)\) be the reward
for player \(A\) in state \(s\). (The reward for \(B\) is always equal and
opposite.)
\begin{enumerate}
\item Let \(U_A(s)\) be the utility of
state \(s\) when it is \(A\)'s turn to move in \(s\), and let \(U_B(s)\)
be the utility of
state \(s\) when it is \(B\)'s turn to move in \(s\).
All rewards and utilities
are calculated from \(A\)'s point of view (just as in a minimax game tree).
Write down Bellman equations defining \(U_A(s)\) and \(U_B(s)\).
\item Explain how to do two-player value iteration with these
equations, and define a suitable termination criterion.
\item Consider the game described in \figref{line-game4-figure} on
\pgref{line-game4-figure}. Draw the state space (rather than the game
tree), showing the moves
by \(A\) as solid lines and moves by \(B\) as dashed lines.
Mark each state with \(R(s)\). You will find it helpful to arrange the
states \((s_A,s_B)\) on a two-dimensional grid, using \(s_A\) and \(s_B\) as ``coordinates.''
\item Now apply two-player value iteration to solve this game, and
derive the optimal policy.
\end{enumerate}
\end{exercise}
% id=17.12 section=17.2
\begin{figure}[tbp]
\figboxnew{figures/grid-mdp-figure.eps}
\caption{(a) \(3\stimes 3\) world for \protect\exref{3x3-mdp-exercise}. The reward for each state is indicated. The upper right square is a terminal state. (b) \(101\stimes 3\) world for \protect\exref{101x3-mdp-exercise} (omitting 93 identical columns in the middle).
The start state has reward 0.}
\label{grid-mdp-figure}
\end{figure}
% id=17.16 section=17.2
\begin{exercise}[3x3-mdp-exercise]
Consider the \(3 \times 3\) world shown in \figref{grid-mdp-figure}(a).
The transition model is the same as in the \(4\stimes 3\)
\figref{sequential-decision-world-figure}: 80\% of the time the agent
goes in the direction it selects; the rest of the time it moves at
right angles to the intended direction.
Implement value iteration for this world for each value of
\(r\) below. Use discounted rewards with a discount factor of 0.99. Show
the policy obtained in each case. Explain intuitively why the value
of \(r\) leads to each policy.
\begin{enumerate}
\item \(r = -100\)
\item \(r = -3\)
\item \(r = 0\)
\item \(r = +3\)
\end{enumerate}
\end{exercise}
% id=17.17 section=17.2
\begin{exercise}[101x3-mdp-exercise]
Consider the \(101 \times 3\) world shown in \figref{grid-mdp-figure}(b).
In the start state the agent has a choice of
two deterministic actions, {\em Up} or {\em Down}, but in the other
states the agent has one deterministic action, {\em Right}. Assuming a discounted
reward function, for what values of the discount \(\gamma\) should the
agent choose {\em Up} and for which {\em Down}? Compute the utility of
each action as a function of \(\gamma\). (Note that this simple
example actually reflects many real-world situations in which one must
weigh the value of an immediate action versus the potential continual
long-term consequences, such as choosing to dump pollutants into a
lake.)
\end{exercise}
% id=17.18 section=17.2
%%%% 17.3: Policy Iteration (3 exercises, 1 labelled)
%%%% ================================================
\begin{exercise}
Consider an undiscounted MDP having three states, (1, 2, 3), with
rewards \(-1\), \(-2\), \(0\), respectively. State 3 is a terminal state. In
states 1 and 2 there are two possible actions: \(a\) and~\(b\). The
transition model is as follows:
\begin{itemize}
\item In state 1, action \(a\) moves the agent to state 2 with probability 0.8
and makes the agent stay put with probability 0.2.
\item In state 2, action \(a\) moves the agent to state 1 with probability 0.8
and makes the agent stay put with probability 0.2.
\item In either state 1 or state 2, action \(b\) moves the agent to state 3 with
probability 0.1 and makes the agent stay put with probability 0.9.
\end{itemize}
Answer the following questions:
\begin{enumerate}
\item What can be determined {\em qualitatively} about the optimal
policy in states 1 and 2?
\item Apply policy iteration, showing each step in full,
to determine the optimal policy and the values of states 1 and 2.
Assume that the initial policy has action \(b\) in both states.
\item What happens to policy iteration if the initial policy
has action \(a\) in both states? Does discounting help?
Does the optimal policy depend on the discount factor?
\end{enumerate}
\end{exercise}
% id=17.4 section=17.3
\begin{exercise}
\prgex
Consider the \(4\stimes 3\) world shown in \figref{sequential-decision-world-figure}.
\begin{enumerate}
\item Implement an environment simulator for this environment, such
that the specific geography of the environment is easily altered.
Some code for doing this is already in the online code repository.
\item Create an agent that uses policy iteration, and measure its
performance in the environment simulator from various starting
states. Perform several experiments from each starting state, and
compare the average total reward received per run with the utility of
the state, as determined by your algorithm.
\item Experiment with increasing the size of the environment. How does
the run time for policy iteration vary with the size of the environment?
%\item Analyze the policy iteration algorithm to find its worst-case
%complexity, assuming that value
%determination is done with a standard equation-solving method.
\end{enumerate}
\end{exercise}
% id=17.6 section=17.3
\begin{exercise}[policy-loss-exercise]%
How can the value\index{value determination} determination algorithm
be used to calculate the expected loss experienced by an agent using a
given set of utility estimates \var{U} and an estimated model \var{P},
compared with an agent using correct values?
\end{exercise}
% id=21.4 section=17.3
%%%% 17.4: Partially observable MDPs (3 exercises, 2 labelled)
%%%% =========================================================
\begin{exercise}[4x3-pomdp-exercise]
Let the initial belief state \(b_0\) for the \(4\stimes 3\) POMDP on \pgref{4x3-pomdp-page} be
the uniform distribution over the nonterminal states, i.e.,
\(\<\frac{1}{9},\frac{1}{9},\frac{1}{9},\frac{1}{9},\frac{1}{9},\frac{1}{9},\frac{1}{9},\frac{1}{9},\frac{1}{9},0,0\>\).
Calculate the exact belief state \(b_1\) after the agent moves \act{Left} and its sensor reports 1 adjacent wall.
Also calculate \(b_2\) assuming that the same thing happens again.
\end{exercise}
% id=17.9 section=17.4
\begin{exercise}
What is the time complexity of \(d\) steps of POMDP value iteration for a sensorless environment?
\end{exercise}
% id=17.10 section=17.4
\begin{exercise}[2state-pomdp-exercise]
Consider a version of the two-state POMDP on \pgref{2state-pomdp-page} in which
the sensor is 90\% reliable in state 0 but provides no information in state 1 (that is, it reports 0 or 1 with equal probability).
Analyze, either qualitatively or quantitatively, the utility function and the optimal policy for this problem.
\end{exercise}
% id=17.11 section=17.4
%%%% 17.5: Decisions with Multiple Agents: Game Theory (5 exercises, 1 labelled)
%%%% ===========================================================================
\begin{exercise}[dominant-equilibrium-exercise]%
Show that a dominant strategy equilibrium is a Nash equilibrium,
but not vice versa.%
\end{exercise}%
% id=17.13 section=17.5
\begin{uexercise}
In the children's game of rock--paper--scissors each player
reveals at the same time a choice of rock, paper, or scissors. Paper
wraps rock, rock blunts scissors, and scissors cut paper. In the
extended version rock--paper--scissors--fire--water,
fire beats rock, paper, and scissors; rock, paper, and scissors
beat water; and water beats fire. Write out the payoff matrix and
find a mixed-strategy solution to this game.
\end{uexercise}
% id=17.14 section=17.5
\begin{iexercise}
Solve the game of {\em three}-finger Morra.
\end{iexercise}
% id=17.15 section=17.5
\begin{iexercise}
In the {\em Prisoner's Dilemma}\index{prisoner's dilemma}, consider
the case where after each round, Alice and Bob have probability \(X\)
meeting again. Suppose both players choose the perpetual punishment
strategy (where each will choose \(\J{refuse}\) unless the other player has
ever played \(\J{testify}\)). Assume neither player has played \(\J{testify}\) thus
far. What is the expected future total payoff for choosing to \(\J{testify}\)
versus \(\J{refuse}\) when \(X = .2\)? How about when \(X = .05\)? For what value of
\(X\) is the expected future total payoff the same whether one chooses to
\(\J{testify}\) or \(\J{refuse}\) in the current round?
\end{iexercise}
% id=17.19 section=17.5
\begin{uexercise}
The following payoff matrix, from \citeA{Blinder:1983} by way of
\citeA{Bernstein:1996}, shows a game between politicians and the
Federal Reserve.
\medskip
\begin{center}
\begin{tabular}{|l|c|c|c|}
\hline
& Fed: contract & Fed: do nothing & Fed: expand \\
\hline
Pol: contract & \(F=7, P=1\) & \(F=9,P=4\) & \(F=6,P=6\) \\
\hline
Pol: do nothing & \(F=8, P=2\) & \(F=5,P=5\) & \(F=4,P=9\) \\
\hline
Pol: expand & \(F=3, P=3\) & \(F=2,P=7\) & \(F=1,P=8\) \\
\hline
\end{tabular}
\end{center}
\medskip
Politicians can expand or contract fiscal policy, while the Fed can
expand or contract monetary policy. (And of course either side can
choose to do nothing.) Each side also has preferences for who should do
what---neither side wants to look like the bad guys. The payoffs
shown are simply the rank orderings: 9 for first choice through 1
for last choice. Find the Nash equilibrium of the game in pure
strategies. Is this a Pareto-optimal solution? You might wish to
analyze the policies of recent administrations in this light.
\end{uexercise}
% id=17.23 section=17.5
%%%% 17.6: Mechanism Design (3 exercises, 0 labelled)
%%%% ================================================
\begin{exercise}
A Dutch auction\index{auction!Dutch} is similar in an English
auction, but rather than starting the bidding at a low price and
increasing, in a Dutch auction the seller starts at a high price
and gradually lowers the price until some buyer is willing to accept
that price. (If multiple bidders accept the price, one is
arbitrarily chosen as the winner.) More formally, the seller begins
with a price \(p\) and gradually lowers \(p\) by increments of
\(d\) until at least one buyer accepts the price. Assuming all bidders
act rationally, is it true that for arbitrarily small \(d\), a Dutch
auction will always result in the bidder with the highest value for
the item obtaining the item? If so, show mathematically why. If not,
explain how it may be possible for the bidder with highest value for
the item not to obtain it.
\end{exercise}
% id=17.20 section=17.6
\begin{exercise}
Imagine an auction mechanism that is just like an ascending-bid
auction, except that at the end, the winning bidder, the one who bid
\(b_{max}\), pays only \(b_{max}/2\) rather than \(b_{max}\). Assuming
all agents are rational, what is the expected revenue to the
auctioneer for this mechanism, compared with a standard ascending-bid auction?
\end{exercise}
% id=17.21 section=17.6
\begin{exercise}
Teams in the National Hockey League historically received 2 points for
winning a game and 0 for losing. If the game is tied, an overtime
period is played; if nobody wins in overtime, the game is a tie and
each team gets 1 point. But league officials felt that teams were
playing too conservatively in overtime (to avoid a loss), and it would
be more exciting if overtime produced a winner. So in 1999 the officials
experimented in mechanism design: the rules
were changed, giving a team that loses in overtime 1 point, not 0. It
is still 2 points for a win and 1 for a tie.
\begin{enumerate}
\item Was hockey a zero-sum game before the rule change? After?
\item Suppose that at a certain time \(t\) in a game, the home team
has probability \(p\) of winning in regulation time, probability
\(0.78-p\) of losing, and probability 0.22 of going into overtime,
where they have probability \(q\) of winning, \(.9-q\) of losing,
and .1 of tying. Give equations for the expected value for the home and visiting
teams.
\item Imagine that it were legal and ethical for the two teams to
enter into a pact where they agree that they will skate to a tie in
regulation time, and then both try in earnest to win in overtime. Under
what conditions, in terms of \(p\) and \(q\), would it be rational for
both teams to agree to this pact?
\item \citeA{Longley+Sankaran:2005} report that since the rule change,
the percentage of games with a winner in overtime went up 18.2\%, as desired, but
the percentage of overtime games also went up 3.6\%. What does that suggest
about possible collusion or conservative play after the rule change?
\end{enumerate}
\end{exercise}
% id=17.22 section=17.6