The compute_loss function is wrong for the Simplest Policy Gradient #414

alantpetrescu · 2024-06-05T07:54:08Z

I have been reading the 3 parts from "Introduction to RL" section and I have observed in part 3 that the compute_loss function for the Simplest Policy Gradient returns the mean of the product between the log probabilities of the actions taken by the agent and the weights of those actions, in other words, the finite-horizon undiscounted returns of the episodes in which they were taken.

In the estimation of the Basic Gradient Policy above, the sums of products is divided by the number of trajectories, but in the implementation, when you return the mean, the sums of products is divided by the number of all the actions taken across all the trajectories from one epoch. Maybe I am understanding this wrong, but I wanted to get a clear picture on the implementation.

hirodeng · 2024-06-28T02:20:45Z

I notice the same problem

earnesdm · 2024-08-25T20:52:09Z

@alantpetrescu I think you are correct that the equation written differs from what is implemented in code, but only by a constant multiple. Since we multiply the gradient estimate by the learning rate when performing gradient ascent, the constant multiple doesn't really matter.

burichh · 2025-01-30T16:47:00Z

@alantpetrescu I think you are correct that the equation written differs from what is implemented in code, but only by a constant multiple. Since we multiply the gradient estimate by the learning rate when performing gradient ascent, the constant multiple doesn't really matter.

I think the problem with this is that not all trajectories are of the same length. Some trajectories might be short (e.g. the game ends early), and some trajectories might be really long. This means that a single batch will most likely contain trajectories of different length.

Imagine the following. You sample a batch:

First batch: For the sake of example, let's think of the most extreme case, a single batch contains only two trajectories: a very short one with only 1 observation, and a long one with 1000 observations. In this case, you should only divide the values with 2 (num of trajectories), and not with 1001 (num of observations). But actually, according to the current implementation -(logp * weights).mean(), you divide all values with 1001.

Update according to the gradients. Sample another batch:

Second batch: In a more even case, a single batch contains a dozens of trajetories, maybe 100, finishing the game after 10 observations each time. In this case, you should only divide the values with 100 (num of trajectories) and not with 1000 (num of observations).

update according to the gradients.

Now, for both batches, we divided the sum with roughly 1000, which could be reshuffled into the learning rate indeed, but the big problem is that batch by batch the denominator will most like change! It's equivalent to changing the learning rate for each batch, which surely will cause a problem.

So @alantpetrescu I think this is still a bug. It might work with this because if the length of the episodes are roughly equal and do not vary too much batch by batch, this effect will not be observable too much, and can be thought of as a rescaled learning rate. But I'm not sure if we have such even length episodes, or any guarantee of that whatsoever.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The compute_loss function is wrong for the Simplest Policy Gradient #414

The compute_loss function is wrong for the Simplest Policy Gradient #414

alantpetrescu commented Jun 5, 2024 •

edited

Loading

hirodeng commented Jun 28, 2024

earnesdm commented Aug 25, 2024 •

edited

Loading

burichh commented Jan 30, 2025 •

edited

Loading

The compute_loss function is wrong for the Simplest Policy Gradient #414

The compute_loss function is wrong for the Simplest Policy Gradient #414

Comments

alantpetrescu commented Jun 5, 2024 • edited Loading

hirodeng commented Jun 28, 2024

earnesdm commented Aug 25, 2024 • edited Loading

burichh commented Jan 30, 2025 • edited Loading

alantpetrescu commented Jun 5, 2024 •

edited

Loading

earnesdm commented Aug 25, 2024 •

edited

Loading

burichh commented Jan 30, 2025 •

edited

Loading