Skip to content

Commit

Permalink
Fixed the typo
Browse files Browse the repository at this point in the history
  • Loading branch information
ZuzooVn committed Nov 17, 2016
1 parent aece353 commit 2d6963e
Show file tree
Hide file tree
Showing 8 changed files with 31 additions and 31 deletions.
4 changes: 2 additions & 2 deletions DP/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,11 @@
### Summary

- Dynamic Programming (DP) methods assume that we have a perfect model of the environment's Markov Decision Process (MDP). That's usually not the case in practice, but it's important to study DP anyway.
- Policy Evaluation: Calculates the state-value function `V(s)` for a given policy. In DP this is done using a "full backup". At each state we look ahead one step at each possible action and next state. We can only do this because we have a perfect model of the environment.
- Policy Evaluation: Calculates the state-value function `V(s)` for a given policy. In DP this is done using a "full backup". At each state, we look ahead one step at each possible action and next state. We can only do this because we have a perfect model of the environment.
- Full backups are basically the Bellman equations turned into updates.
- Policy Improvement: Given the correct state-value function for a policy we can act greedily with respect to it (i.e. pick the best action at each state). Then we are guaranteed to improve the policy or keep it fixed if it's already optimal.
- Policy Iteration: Iteratively perform Policy Evaluation and Policy Improvement until we reach the optimal policy.
- Value Iteration: Instead of doing multiple steps of Policy Evaluation to find the "correct" V(s) we only do a single step and improve the policy immediately. In practice this converges faster.
- Value Iteration: Instead of doing multiple steps of Policy Evaluation to find the "correct" V(s) we only do a single step and improve the policy immediately. In practice, this converges faster.
- Generalized Policy Iteration: The process of iteratively doing policy evaluation and improvement. We can pick different algorithms for each of these steps but the basic idea stays the same.
- DP methods bootstrap: They update estimates based on other estimates (one step ahead).

Expand Down
4 changes: 2 additions & 2 deletions DQN/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,9 @@

- DQN: Q-Learning but with a Deep Neural Network as a function approximator.
- Using a non-linear Deep Neural Network is powerful, but training is unstable if we apply it naively.
- Trick 1 - Experience Replay: Store experience `(S, A, R, S_next)` in a replay buffer and sample minibatches from it to train the network. This decorrelates the data and leads to better data efficiency. In the beginning the replay buffer is filled with random experience.
- Trick 1 - Experience Replay: Store experience `(S, A, R, S_next)` in a replay buffer and sample minibatches from it to train the network. This decorrelates the data and leads to better data efficiency. In the beginning, the replay buffer is filled with random experience.
- Trick 2 - Target Network: Use a separate network to estimate the TD target. This target network has the same architecture as the function approximator but with frozen parameters. Every T steps (a hyperparameter) the parameters from the Q network are copied to the target network. This leads to more stable training because it keeps the target function fixed (for a while).
- By using a Convolutional Neural Network as the function approximator on raw pixels of Atari games where the score is the reward we can learn to play many of those games at human-like performance.
- By using a Convolutional Neural Network as the function approximator on raw pixels of Atari games where the score is the reward we can learn to play many of those games at the human-like performance.
- Double DQN: Just like regular Q-Learning, DQN tends to overestimate values due to its max operation applied to both selecting and estimating actions. We get around this by using the Q network for selection and the target network for estimation when making updates.


Expand Down
4 changes: 2 additions & 2 deletions Introduction/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,10 @@

### Summary

- Reinforcement Learning (RL) is concered with goal-directed learning and decison-making.
- Reinforcement Learning (RL) is concerned with goal-directed learning and decision-making.
- In RL an agent learns from experiences it gains by interacting with the environment. In Supervised Learning we cannot affect the environment.
- In RL rewards are often delayed in time and the agent tries to maximize a long-term goal. For example, one may need to make seemingly suboptimal moves to reach a winning position in a game.
- An agents interacts with the environment via states, actions and rewards.
- An agent interacts with the environment via states, actions and rewards.


### Lectures & Readings
Expand Down
10 changes: 5 additions & 5 deletions MC/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,13 @@

### Summary

- Dynamic Programming approaches assume complete knowledge of the environment (the MDP). In practice we often don't have full knowledge of how the world works.
- Dynamic Programming approaches assume complete knowledge of the environment (the MDP). In practice, we often don't have full knowledge of how the world works.
- Monte Carlo (MC) methods can learn directly from experience collected by interacting with the environment. An episode of experience is a series of `(State, Action, Reward, Next State)` tuples.
- MC methods work based on episodes. We sample episodes of experience and make updates to our estimates at the end of each episode. MC methods have high variance (due to lots of random decisions within an episode) but are unbiased.
- MC Policy Evaluation: Given a policy, we want to estimate the state-value function V(s). Sample episodes of experience and estimate V(s) to be the reward received from that state onwards averaged across all of your experience. The same technique works for the action-value function Q(s, a). Given enough samples this is proven to converge.
- MC Policy Evaluation: Given a policy, we want to estimate the state-value function V(s). Sample episodes of experience and estimate V(s) to be the reward received from that state onwards averaged across all of your experience. The same technique works for the action-value function Q(s, a). Given enough samples, this is proven to converge.
- MC Control: Idea is the same as for Dynamic Programming. Use MC Policy Evaluation to evaluate the current policy then improve the policy greedily. The Problem: How do we ensure that we explore all states if we don't know the full environment?
- Solution to exploration problem: Use epsilon-greedy policies instead of full greedy policies. When making a decision act randomly with probability epsilon. This will learn the optimal epsilon-greedy policy.
- Off-Policy Learning: How can we learn about the actual optimal (greedy) policy while following an exploratory (epsilon greedy) policy? We can use importance sampling, which weighs returns by their probability of occuring under the policy we want to learn about.
- Off-Policy Learning: How can we learn about the actual optimal (greedy) policy while following an exploratory (epsilon-greedy) policy? We can use importance sampling, which weighs returns by their probability of occurring under the policy we want to learn about.


### Lectures & Readings
Expand All @@ -37,13 +37,13 @@

### Exercises

- [Get familar with the Blackjack environment (Blackjack-v0)](Blackjack Playground.ipynb)
- [Get familiar with the Blackjack environment (Blackjack-v0)](Blackjack Playground.ipynb)
- Implement the Monte Carlo Prediction to estimate state-action values
- [Exercise](MC Prediction.ipynb)
- [Solution](MC Prediction Solution.ipynb)
- Implement the on-policy first-visit Monte Carlo Control algorithm
- [Exercise](MC Control with Epsilon-Greedy Policies.ipynb)
- [Solution](MC Control with Epsilon-Greedy Policies Solution.ipynb)
- Implement the off-policy every-visit Monte Carlo Control using Weighted Important Sampliing algorithm
- Implement the off-policy every-visit Monte Carlo Control using Weighted Important Sampling algorithm
- [Exercise](Off-Policy MC Control with Weighted Importance Sampling.ipynb)
- [Solution](Off-Policy MC Control with Weighted Importance Sampling Solution.ipynb)
6 changes: 3 additions & 3 deletions MDP/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,11 @@
- Agent & Environment Interface: At each step `t` the agent receives a state `S_t`, performs an action `A_t` and receives a reward `R_{t+1}`. The action is chosen according to a policy function `pi`.
- The total return `G_t` is the sum of all rewards starting from time t . Future rewards are discounted at a discount rate `gamma^k`.
- Markov property: The environment's response at time `t+1` depends only on the state and action representations at time `t`. The future is independent of the past given the present. Even if an environment doesn't fully satisfy the Markov property we still treat it as if it is and try to construct the state representation to be approximately Markov.
- Markov Decision Process (MDP): Defined by a state set S, action set A and one-step dynamics `p(s',r | s,a)`. If we have complete knowledge of the environment we know the transition dynamic. In practice we often don't know the full MDP (but we know that it's some MDP).
- Markov Decision Process (MDP): Defined by a state set S, action set A and one-step dynamics `p(s',r | s,a)`. If we have complete knowledge of the environment we know the transition dynamic. In practice, we often don't know the full MDP (but we know that it's some MDP).
- The Value Function `v(s)` estimates how "good" it is for an agent to be in a particular state. More formally, it's the expected return `G_t` given that the agent is in state `s`. `v(s) = Ex[G_t | S_t = s]`. Note that the value function is specific to a given policy `pi`.
- Action Value function: q(s, a) estimates how "good" it is for an agent to be in state s and take action a. Similar to the value function, but also considers the action.
- Action Value function: q(s, a) estimates how "good" it is for an agent to be in states and take action a. Similar to the value function, but also considers the action.
- The Bellman equation expresses the relationship between the value of a state and the values of its successor states. It can be expressed using a "backup" diagram. Bellman equations exist for both the value function and the action value function.
- Value functions define an ordering over policies. A policy `p1` is better than `p2` if `v_p1(s) >= v_p2(s)` for all states s. For MDPs there exist one or more optimal policies that are better than or equal to all other policies.
- Value functions define an ordering over policies. A policy `p1` is better than `p2` if `v_p1(s) >= v_p2(s)` for all states s. For MDPs, there exist one or more optimal policies that are better than or equal to all other policies.
- The optimal state value function `v*(s)` is the value function for the optimal policy. Same for `q*(s, a)`. The Bellman Optimality Equation defines how the optimal value of a state is related to the optimal value of successor states. It has a "max" instead of an average.


Expand Down
28 changes: 14 additions & 14 deletions PolicyGradient/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,12 @@

### Learning Goals

- Understand the difference between value-based and policy-based Reinforcement LEarning
- Understand the difference between value-based and policy-based Reinforcement Learning
- Understand the REINFORCE Algorithm (Monte Carlo Policy Gradient)
- Understand Actor-Critic (AC) algorithms
- Understand Advantage Functions
- Understand Deterministic Policy Gradients (Optional)
- Understand how to scale up Policy Gradient methods using asynchronous actor critic and Neural Networks (Optional)
- Understand how to scale up Policy Gradient methods using asynchronous actor-critic and Neural Networks (Optional)


### Summary
Expand All @@ -17,15 +17,15 @@
- Sometimes the policy is easier to approximate than the value function. Also, we need a parameterized policy to deal with continuous action spaces and environments where we need to act stochastically.
- Policy Score Function `J(theta)`: Intuitively, it measures how good our policy is. For example, we can use the average value or average reward under a policy as our objective.
- Common choices for the policy function: Softmax for discrete actions, Gaussian parameters for continuous actions.
- Policy Gradient Theorem: `grad(J(theta)) = Ex[grad(log(pi(s, a))) * Q(s, a)]`. Basically, we move our policy into a direction of more reward.
- REINFORCE (Monte Carlo Policy Gradient): We substitute a samples return `g_t` form an episode for Q(s, a) to make an update. Unbiased but high variance.
- Baseline: Instead of measuring the absolute goodness of an action we want to know how much better than "average" it is to take an action given a state. E.g. some states are naturally bad and always give negative reward. This is called the advantage and is defined as `Q(s, a) - V(s)`. We use that for our policy update, e.g. `g_t - V(s)` for REINFORCE.
- Actor Critic: Instead of waiting until the end of an episode as in REINFORCE we use bootstrapping and make an update at each step. To do that we also train a Critic Q(theta) that approximates the value function. Now we have two function approximators: One of the policy, one for the critic. This is basically TD, but for Policy Gradients.
- A good estimate of the advantage function in the Actor Critic algorithm is the td error. Our update then becomes `grad(J(theta)) = Ex[grad(log(pi(s, a))) * td_error]`.
- Policy Gradient Theorem: `grad(J(theta)) = Ex[grad(log(pi(s, a))) * Q(s, a)]`. Basically, we move our policy in a direction of more reward.
- REINFORCE (Monte Carlo Policy Gradient): We substitute a samples return `g_t` from an episode for Q(s, a) to make an update. Unbiased but high variance.
- Baseline: Instead of measuring the absolute goodness of an action we want to know how much better than "average" it is to take an action given a state. E.g. some states are naturally bad and always give the negative reward. This is called the advantage and is defined as `Q(s, a) - V(s)`. We use that for our policy update, e.g. `g_t - V(s)` for REINFORCE.
- Actor-Critic: Instead of waiting until the end of an episode as in REINFORCE we use bootstrapping and make an update at each step. To do that we also train a Critic Q(theta) that approximates the value function. Now we have two function approximators: One of the policy, one for the critic. This is basically TD, but for Policy Gradients.
- A good estimate of the advantage function in the Actor-Critic algorithm is the td error. Our update then becomes `grad(J(theta)) = Ex[grad(log(pi(s, a))) * td_error]`.
- Can use policy gradients with td-lambda, eligibility traces, and so on.
- Deterministic Policy Gradients: Useful for high-dimensional continuous action spaces where stochastic policy gradients are expensive to compute. The idea is to update the policy in the direction of the gradient of the action-value function. To ensure exploration we can use an off-policy actor critic algorithm with added noise in action selection.
- Deterministic Policy Gradients: Useful for high-dimensional continuous action spaces where stochastic policy gradients are expensive to compute. The idea is to update the policy in the direction of the gradient of the action-value function. To ensure exploration we can use an off-policy actor-critic algorithm with added noise in action selection.
- Deep Deterministic Policy Gradients: Apply tricks from DQN to Deterministic Policy Gradients ;)
- Asynchronous Advantage Actor Critic (A3C): Instead of using an experience replay buffer as in DQN use multiple agents on different threads to explore the state spaces and make decorrelated updates to the actor and the critic.
- Asynchronous Advantage Actor-Critic (A3C): Instead of using an experience replay buffer as in DQN use multiple agents on different threads to explore the state spaces and make decorrelated updates to the actor and the critic.


### Lectures & Readings
Expand All @@ -51,14 +51,14 @@
- REINFORCE with Baseline
- Exercise
- [Solution](CliffWalk REINFORCE with Baseline Solution.ipynb)
- Actor Critic with Baseline
- Actor-Critic with Baseline
- Exercise
- [Solution](CliffWalk Actor Critic Solution.ipynb)
- Actor Critic with Baseline for Continuous Action Spaces
- [Solution](CliffWalk Actor-Critic Solution.ipynb)
- Actor-Critic with Baseline for Continuous Action Spaces
- Exercise
- [Solution](Continuous MountainCar Actor Critic Solution.ipynb)
- [Solution](Continuous MountainCar Actor-Critic Solution.ipynb)
- Deterministic Policy Gradients for Continuous Action Spaces (WIP)
- Deep Deterministic Policy Gradients (WIP)
- Asynchronous Advantage Actor Critic (A3C)
- Asynchronous Advantage Actor-Critic (A3C)
- Exercise
- [Solution](a3c/)
2 changes: 1 addition & 1 deletion PolicyGradient/a3c/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## Implementation of A3C (Asynchronous Advantage Actor Critic)
## Implementation of A3C (Asynchronous Advantage Actor-Critic)

#### Running

Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@ This repository provides code, exercises and solutions for popular Reinforcement
- [Reinforcement Learning: An Introduction (2nd Edition)](https://webdocs.cs.ualberta.ca/~sutton/book/bookdraft2016sep.pdf)
- [David Silver's Reinforcement Learning Course](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html)

Each folder in corresponds to one or more chapters of the above textbook and/or course. In addition to exercises and solution each folder also contains a list of learning goals, a brief concept summary, and links to the relevant readings.
Each folder in corresponds to one or more chapters of the above textbook and/or course. In addition to exercises and solution, each folder also contains a list of learning goals, a brief concept summary, and links to the relevant readings.

All code is written in Python 3 and use RL environments from [OpenAI Gym](https://gym.openai.com/). Advanced techniques use [Tensorflow](https://www.tensorflow.org/) for neural network implementations.
All code is written in Python 3 and uses RL environments from [OpenAI Gym](https://gym.openai.com/). Advanced techniques use [Tensorflow](https://www.tensorflow.org/) for neural network implementations.


### Table of Contents
Expand Down

0 comments on commit 2d6963e

Please sign in to comment.