Continuous states and actions in high dimensional spaces cannot be treated by most off-the-shelf reinforcement learning approaches. Policy gradient methods differ significantly as they do not suffer from these problems in the same way. For example, uncertainty in the state might degrade the performance of the policy (if no additional state estimator is being used) but the optimization techniques for the policy do not need to be changed. Continuous states and actions can be dealt with in exactly the same way as discrete ones while, in addition, the learning performance is often increased. Convergence at least to a local optimum is guaranteed (critical for robotics).
The advantages of policy gradient methods for real world applications are numerous. Among the most important ones are that the policy representations can be chosen so that it is meaningful for the task and can incorporate domain knowledge, that often fewer parameters are needed in the learning process than in value-function based approaches and that there is a variety of different algorithms for policy gradient estimation in the literature which have a rather strong theoretical underpinning. Additionally, policy gradient methods can be used either model-free or model-based as they are a generic formulation.
Policy gradient methods however,
are by definition on-policy (note that tricks like importance sampling can slightly alleviate this problem) and need to forget data very fast in order to avoid the introduction of a bias to the gradient estimator. Hence, the use of sampled data is not very efficient. In tabular representations, value function methods are guaranteed to converge to a global maximum while policy gradients only converge to a local maximum and there may be many maxima in discrete problems. Policy gradient methods are often quite demanding to apply, mainly because one has to have considerable knowledge about the system one wants to control to make reasonable policy definitions. Finally, policy gradient methods always have an open parameter, the learning rate, which may decide over the order of magnitude of the speed of convergence, these have led to new approaches inspired by expectation-maximization (see, e.g., Vlassis et al., 2009; Kober & Peters, 2008).
The three main components of a Reinfocement Learning (RL) system for robotics include the state (also found in literature as ), the action (also found as ) and the reward denoted by . We will denote the current time step by . The stochasticity of the environment gets represented by using a probability distribution as model where denotes the current action and , denote the current and next state, respectively. Further, we assume that actions are generated by a policy which is modeled as a probability distribution in order to incorporate exploratory actions. The policy is assumed to be parameterized by policy parameters . The sequence of states and actions forms a trajectory denoted by where denoted the horizon which can be infinite. Often, trajectory, history, trial or roll-out are used interchangeably. At each instant of time, the learning system receives a reward .
The general goal of policy optimization in reinforcement learning for robots is to optimize the policy parameters so that the expected return is optimized:
where is for discounted reinforcement learning and for the average reward case.
For real-world applications, we require that any change to the policy parameterization has to be smooth as drastic changes can be hazardous for the actor as well as useful initializations of the policy based on domain knowledge would otherwise vanish after a single update step. For these reasons, policy gradient methods which follow the steepest descent on the expected return are the method of choice. These methods update the policy parameterization according to the gradient update rule
where denotes a learning rate and the current update number.
The main problem in policy gradient methods is to obtain a good estimator of the policy gradient In robotics, people have traditionally used deterministic model-based methods for obtaining the gradient however, this approach does not scale well as one needs to be able to model every detail of the system. Alternatively, we can estimate the policy gradient simply from data generated during the execution of a task, i.e., without the need for a model. The most prominent approaches for policy gradient estimation, which have been applied to robotics are: finite-difference and likelihood ratio methods (better known as REINFORCE in the RL community). According to (cite PG methods paper), of these two, the latter has a variety of advantages in comparison to finite difference methods when applied to robotics.
In likelihood ration methods we assume that trajectories are generated from a system by roll-outs, i.e., with return which leads to . In this case, the policy gradient can be estimated using the likelihood ratio (see e.g. Glynn, 1987; Aleksandrov, Sysoyev, and Shemeneva, 1968) better known as REINFORCE (Williams, 1992) trick, i.e., by using:
from standard differential calculus , we obtain
As the expectation can be replaced by sample averages, denoted by , only the derivative is needed for the estimator. Importantly, this derivative can be computed without knowledge of the generating distribution as:
which implies that:
as only the policy depends on . Thus, the derivatives of do not have to be computed and no model needs to be maintained. However, if we had a deterministic policy instead of a stochastic policy , computing such a derivative would require the derivative and, hence, it would require a system model.
In order to reduce the variance of the gradient estimator, a constant baseline can be subtracted from the gradient, i.e.,
where the baseline can be chosen arbitrarily (Williams, 1992). It is straightforward to show that this baseline does not introduce bias in the gradient as differentiating implies that:
and, hence, the constant baseline will vanish for infinite data while reducing the variance of the gradient estimator for finite data. See Peters & Schaal, 2008 for an overview of how to choose the baseline optimally. Therefore, the general path likelihood ratio estimator or episodic REINFORCE gradient estimator is given by
where denotes the average over trajectories. This type of method is guaranteed to converge to the true gradient at the fastest theoretically possible error decrease of where I denotes the number of roll-outs (Glynn, 1987) even if the data is generated from a highly stochastic system.
Pseudocode:
1. Initialize policy (e.g. NNs) parameter <img src="https://rawgit.com/vmayoral/basic_reinforcement_learning/master//tutorial12/tex/27e556cf3caa0673ac49a8f0de3c73ca.svg?invert_in_darkmode" align=middle width=8.173588500000005pt height=22.831379999999992pt/> and baseline <img src="https://rawgit.com/vmayoral/basic_reinforcement_learning/master//tutorial12/tex/4bdc8d9bcfb35e1c9bfb51fc69687dfc.svg?invert_in_darkmode" align=middle width=7.054855500000005pt height=22.831379999999992pt/>
2. For iteration=1,2,... do
2.1 Collect a set of trajectories by executing the current policy obtaining $\mathbf{s}_{0:H},\mathbf{a}_{0:H},r_{0:H}$
2.2 At each timestep in each trajectory, compute
2.2.1 the return $R_t = \sum_{t'=t}^{T-1} \gamma^{t'-t}r_{t'}$ and
2.2.2 the advantage estimate $\hat{A_t} = R_t - b(s_t)$.
2.3 Re-fit the baseline (recomputing the value function) by minimizing
$|| b(s_t) - R_t||^2$, summed over all trajectories and timesteps.
$b=\frac{\left\langle \left( \sum\nolimits_{h=0}^{H} \mathbf{\nabla}_{\theta_{k}}\log\pi_{\mathbf{\theta}}\left( \mathbf{a}_{h}\left\vert \mathbf{s}_{h}\right. \right) \right) ^{2}\sum\nolimits_{l=0}^{H} \gamma r_{l}\right\rangle }{\left\langle \left(
\sum\nolimits_{h=0}^{H}\mathbf{\nabla}_{\theta_{k}}\log\pi_{\mathbf{\theta}
}\left( \mathbf{a}_{h}\left\vert \mathbf{x}_{h}\right. \right) \right)
^{2}\right\rangle }$
2.4 Update the policy, using a policy gradient estimate $\hat{g}$,
which is a sum of terms $\nabla_\theta log\pi(a_t | s_t,\theta)\hat(A_t)$.
In other words:
$g_{k}=\left\langle \left( \sum\nolimits_{h=0}^{H}\mathbf{\nabla
}_{\theta_{k}}\log\pi_{\mathbf{\theta}}\left( \mathbf{a}_{h}\left\vert
\mathbf{s}_{h}\right. \right) \right) \left( \sum\nolimits_{l=0}^{H}
\gamma r_{l}-b\right) \right\rangle$
3. **end for**
In stochastic policy gradient, actions are drawn from a distribution parameterized by your policy. For example, your robot’s motor torque might be drawn from a Normal distribution with mean μμ and deviation σσ. Where your policy will predict μμ and σσ. When you draw from this distribution and evaluate your policy, you can move your mean closer to samples that led to higher reward and farther from samples that led to lower reward, and reduce your deviation as you become more confident.
When you reduce the variance to 0, we get a policy that is deterministic. In deterministic policy gradient, we directly take the gradients of μμ.
In the stochastic case, the policy gradient integrates over both state and action spaces, whereas in the deterministic case it only integrates over the state space. As a result, computing the stochastic policy gradient may require more samples, especially if the action space has many dimensions.
A discrete implementation of VPG can be found here. The code includes comments with the pseudo-code presented above for readibility.
- http://www.scholarpedia.org/article/Policy_gradient_methods#Likelihood_Ratio_Methods_and_REINFORCE
- https://theneuralperspective.com/2016/10/27/gradient-topics/
- https://theneuralperspective.com/2016/11/25/reinforcement-learning-rl-policy-gradients-i/
- https://github.com/dennybritz/reinforcement-learning/tree/master/PolicyGradient