Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to choose limit the action space differently for each step? [question] #108

Closed
JoelNiklaus opened this issue Dec 2, 2018 · 10 comments
Labels
custom gym env Issue related to Custom Gym Env question Further information is requested

Comments

@JoelNiklaus
Copy link

Hi,

I am developing an AI for a card game. I came across your repo. You are doing a great job! :)
In my card game I have a discrete action space containing 36 values. In each step, 1 to 9 of these values are valid moves. The other values should be excluded (so the probabilities should be set to zero). I was able to implement this in the prediction phase using the action_probabilities. My question is now: How can I implement this in the learning phase?

Cheers,
Joel

@araffin araffin added question Further information is requested custom gym env Issue related to Custom Gym Env labels Dec 2, 2018
@araffin
Copy link
Collaborator

araffin commented Dec 2, 2018

Hello,
The easiest solution I see to prevent from taking invalid actions is the following:

  • Keep the action space constant (because otherwise I don't think RL algo are made for changing action space)
  • Give a negative reward for taking an invalid action (but do nothing in that case, that is to say the game state should stay the same)
  • Stop the game early if too much invalid actions are taken

The idea is that the agent should learn at least to take valid actions to maximize its cumulative reward.

@JoelNiklaus
Copy link
Author

Thank you very much for the quick response.
Yes, I thought of that, but over 100k episodes there was no improvement. I am going to run the experiment on more episodes.

But I thought, that disabling invalid moves already in training might speed up the learning process. Because this could benefit further experimentation, it would be nice if that option was available. Is there any way, or would it be way too time consuming to implement something like that? If it is feasible, I am happy to provide a PR.

Your third point of stopping the game early is interesting. Why could this help?

@araffin
Copy link
Collaborator

araffin commented Dec 3, 2018

Your third point of stopping the game early is interesting. Why could this help?

By using early termination, you avoid exploring regions that are not useful. I remember seeing that used in a game with invalid actions (but I can't remember in which context exactly). More recently, that was apparently key in the success of Deep Mimic (see link below).
To quote them:

Early termination is a staple for RL practitioners, and it is often used to improve simulation efficiency. If the character gets stuck in a state from which there is no chance of success, then the episode is terminated early, to avoid simulating the rest. Here we show that early termination can in fact have a significant impact on the results.

Deep Mimic Paper: https://xbpeng.github.io/projects/DeepMimic/index.html

Because this could benefit further experimentation, it would be nice if that option was available.

Unfortunately, I don't see any straightforward solution for that problem... For me, in the RL framework, the action space is fixed, but I may be wrong. I think you should check how Deepmind did it for AlphaGo, but there were using tree search too...

@bertram1isu
Copy link

bertram1isu commented Dec 3, 2018 via email

@JoelNiklaus
Copy link
Author

JoelNiklaus commented Dec 5, 2018

By using early termination, you avoid exploring regions that are not useful. I remember seeing that used in a game with invalid actions (but I can't remember in which context exactly). More recently, that was apparently key in the success of Deep Mimic (see link below).`

I see. But in my opinion in my game this is not necessary, because the game is only running for 9 timesteps.

Unfortunately, I don't see any straightforward solution for that problem... For me, in the RL framework, the action space is fixed, but I may be wrong. I think you should check how Deepmind did it for AlphaGo, but there were using tree search too...

Yes, I think UCT is the next thing I am going to try.

This would kind of defeat the purpose of using a neural net, but could you
define a mask that said whether to use the neural net output at a state or
instead use a hard coded negative value (such as -1e6)... then when you do
a prediction with the neural net, you could apply the mask... including
when you are extracting your policy.

Interesting. So if I understand this correctly you either take the output of the neural net or this large negative value as an action. I don't quite see how this would solve the problem though. I guess internally the NN is outputting probabilities for each action in the action space. The goal is, to set the probabilities of the invalid action to 0.

@ardabbour
Copy link

Hi, I ran into a similar problem when developing an agent to play Backgammon. I found two ways to work around this; neither is good but definetely worth a shot:

  1. Make your reward function binary: +1 for valid actions, -1 for invalid actions (and take an action randomly when stepping), then train for as long as needed to teach the agent 'the rules of the game'. Save that agent, then revert to your regular reward function (with invalid actions still producing high negative rewards and stepping with random valid actions), load and continue training your agent.

  2. Extract the action probabilities (I'm assuming your action space is discrete), and choose the valid action with the highest probability.

@JoelNiklaus
Copy link
Author

Hi,
Thank you for the hints. I thought of doing this too.
Did you achieve maximal reward when learning the rules?
When I trained for the rules, I did not achieve maximal reward in every case, even after 5M games played.

@ardabbour
Copy link

I am not sure what you mean by maximal reward in every case; if you mean the maximum theoretical reward, then I did not attempt to push the agent to that level. You might not see that unless your games played go to infinity.

That will depend on the algorithm you are using, some algorithms will deliberately avoid the best choice available - it has to do with the whole exploration vs. exploitation idea. Also, what you end up with (the neural network) is a probability distribution function which means that there is a randomness factor in choosing which action to take.

@JoelNiklaus
Copy link
Author

Yes, I mean the maximum theoretical reward (in the rules setting only). If we really want to learn the rules, the maximum theoretical reward is the goal, right? Otherwise we still cannot be sure that the agent is not taking invalid actions.

RIght. Thank you.

@araffin
Copy link
Collaborator

araffin commented Jul 2, 2019

I will close this issue in favor of #351

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
custom gym env Issue related to Custom Gym Env question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants