Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewards for Progress #29

Closed
bzier opened this issue Nov 16, 2017 · 3 comments
Closed

Rewards for Progress #29

bzier opened this issue Nov 16, 2017 · 3 comments

Comments

@bzier
Copy link
Owner

bzier commented Nov 16, 2017

Detection:

Sample the HUD progress line. This may be implemented a few different ways (depending on things like feasibility, performance impacts, etc.)

  • Sample two consecutive points of the HUD progress line, shifting them to the next two pixels as we go
  • Sample at points along the four edges of the character icon to track it's progress in each direction
  • Sample all points along the HUD progress line and calculate the progress and/or position based on the colors at all points

Rewards:

  • +1 reward as a point turns the appropriate color (forward progress)
  • -1 reward if no points change (stationary / no progress)
  • -2 reward if an already achieved point subsequently changes away from the appropriate color (driving backwards / reverse progress)

Challenges:

  • Corners of the HUD progress line:
    Unfortunately, as the character icon changes direction, the icon covers the corners for a longer period of time than the rest of the line. Detecting what is happening here will be difficult. May need to resort to a 0 reward here as an unknown.

Reasoning:

The current checkpoint rewards system seems to cause early convergence to local maxima. I haven't yet proven that this is occurring; it is just anecdotal evidence strictly by observation. However, I have seen the agent drive in a seemingly intentional fashion, straight to a point in the wall, smash against that spot on the wall for 30 seconds, and then drive 'intentionally' to another somewhat distant point on a different wall and, again, smash against it for a while. It appears to me that these points are where the checkpoint rewards are being granted. Of course, the rewards are for reaching a point of progress around the course, and not actually for hitting a specific point on the wall. However, it seems that the agent associates that particular location with the reward.

Qs (not FAQs because nobody has asked them but me):

How will this new system address that issue?
The current rewards are only given sparsely. The reward value is also significantly larger than what the agent 'normally' sees. I believe this causes the agent to train towards those specific known locations that 'guarantee' a high reward value. By reducing the reward to a small value and providing that value consistently as the agent makes progress, the theory is that it should reduce any kind of convergence to a particular location. This should provide a reward signal that is more closely associated with the concept of progress (as opposed to checkpoints that are rewards for being in a specific location).

Why is 'no progress' punished with -1 instead of just 0 reward?
Initially a 0 reward for 'no progress' would likely work. The agent should still learn that forward is good, standing still is not, and going backwards is bad. However, once an agent is able to successfully complete an entire race, the rewards still need to be meaningful. If one agent drives a bit, sits still a bit, drives a bit, sits still a bit, etc... it would eventually finish the course with a +n cumulative reward (where n is the number of +1 progress rewards it was given). If a second agent drives the entire course without stopping, it will complete the course much faster than the first agent, but it will also finish with a +n cumulative reward. In this way, the two agents will believe they are equally successful. Now, if 'no progress' is punished with -1, the end result will correctly indicate that the agent that completed the race faster was more successful.

Is there any relation to #27 ?
This could potentially be combined with #27, perhaps optionally. As described in that issue, an episode could be terminated after n steps unless a checkpoint is reached, which would extend the episode. Alternatively, an episode could be terminated after an agent's cumulative reward dips below some low threshold, or after n consecutive 'bad' steps (no progress, or backwards progress). There are lots of options that could be explored here.

@aymen-mouelhi
Copy link

other aspects should maybe be considered:

  • lap time
  • position of the kart (in the road or not)

@bzier
Copy link
Owner Author

bzier commented Mar 21, 2018

@aymen-mouelhi Thanks for the suggestions. The lap time is actually implicitly considered in the rewards per step in the positive direction. The faster the agent completes a lap (or an entire race), the greater the reward.

Position of the kart is a tricky thing. First of all, detecting that in the environment is difficult. Currently I am only utilizing screen pixels (i.e. no access to internal game state information within the emulator itself).

Secondly, I want to be careful about the shaping of the reward function. Getting too domain-specific reduces the ability to generalize. In the case of MarioKart, the domain-specific reward function, as described above, is simply rewarding based on forwards/backwards movement (i.e. progress). Shaping the reward function to include additional information, like whether or not the kart is on the road, starts to push into the 'how', not just the 'what'. In other words, I'm trying to keep the reward function just rewarding based on whether or not the agent is achieving the goal (forward progress quickly), rather than rewarding based on how it is achieving the goal (staying on the road, avoiding walls, etc). An example where the 'how' may vary is course shortcuts. Often the shortcut is off-road, but results in quicker progress around the course. We don't want to prevent the agent from learning shortcuts on it's own by limiting it's exploration.

Ultimately the more domain-specific, fine-tuned the reward function is, the harder it is to code, the less likely it is to generalize, and the more likely it is to include (potentially problematic) bias. We make all sorts of assumptions (that the road is better), but that may not always be the case, and we want to allow the AI to determine for itself what the best 'how' is without feeding it our biases. Keeping the reward function bound as closely as possible to the goal, while still providing a good enough signal for the agent to learn successfully, is the important piece here.

@bzier
Copy link
Owner Author

bzier commented Mar 21, 2018

The improvements described in this issue are included in PR #38

@bzier bzier closed this as completed Mar 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants