Is it possible to choose limit the action space differently for each step? [question] #108

JoelNiklaus · 2018-12-02T20:20:18Z

Hi,

I am developing an AI for a card game. I came across your repo. You are doing a great job! :)
In my card game I have a discrete action space containing 36 values. In each step, 1 to 9 of these values are valid moves. The other values should be excluded (so the probabilities should be set to zero). I was able to implement this in the prediction phase using the action_probabilities. My question is now: How can I implement this in the learning phase?

Cheers,
Joel

araffin · 2018-12-02T21:14:16Z

Hello,
The easiest solution I see to prevent from taking invalid actions is the following:

Keep the action space constant (because otherwise I don't think RL algo are made for changing action space)
Give a negative reward for taking an invalid action (but do nothing in that case, that is to say the game state should stay the same)
Stop the game early if too much invalid actions are taken

The idea is that the agent should learn at least to take valid actions to maximize its cumulative reward.

JoelNiklaus · 2018-12-03T13:37:39Z

Thank you very much for the quick response.
Yes, I thought of that, but over 100k episodes there was no improvement. I am going to run the experiment on more episodes.

But I thought, that disabling invalid moves already in training might speed up the learning process. Because this could benefit further experimentation, it would be nice if that option was available. Is there any way, or would it be way too time consuming to implement something like that? If it is feasible, I am happy to provide a PR.

Your third point of stopping the game early is interesting. Why could this help?

araffin · 2018-12-03T13:53:39Z

Your third point of stopping the game early is interesting. Why could this help?

By using early termination, you avoid exploring regions that are not useful. I remember seeing that used in a game with invalid actions (but I can't remember in which context exactly). More recently, that was apparently key in the success of Deep Mimic (see link below).
To quote them:

Early termination is a staple for RL practitioners, and it is often used to improve simulation efficiency. If the character gets stuck in a state from which there is no chance of success, then the episode is terminated early, to avoid simulating the rest. Here we show that early termination can in fact have a significant impact on the results.

Deep Mimic Paper: https://xbpeng.github.io/projects/DeepMimic/index.html

Because this could benefit further experimentation, it would be nice if that option was available.

Unfortunately, I don't see any straightforward solution for that problem... For me, in the RL framework, the action space is fixed, but I may be wrong. I think you should check how Deepmind did it for AlphaGo, but there were using tree search too...

bertram1isu · 2018-12-03T14:12:55Z

For a standard (non deep learning) MDP this would be handled by setting elements of the transition matrix (or function) T( s’ | s, a ) to zero to indicate that transition using that action is not possible. With deep reinforcement learning where a neural net is used to estimate the q-value, it provides estimates for every potential action regardless of whether it’s possible. This would kind of defeat the purpose of using a neural net, but could you define a mask that said whether to use the neural net output at a state or instead use a hard coded negative value (such as -1e6)... then when you do a prediction with the neural net, you could apply the mask... including when you are extracting your policy. One problem I see there is the mask would consume a lot of memory if it were stored like a table... probably on the order of the size of the matrix T. If it could be represented as a function I might be able to be more compactly represented...? - J.

…

On Mon, Dec 3, 2018 at 7:53 AM Antonin RAFFIN ***@***.***> wrote: Your third point of stopping the game early is interesting. Why could this help? By using early termination, you avoid exploring regions that are not useful. I remember seeing that used in a game with invalid actions (but I can't remember in which context exactly). More recently, that was apparently key in the success of Deep Mimic (see link below). To quote them: Early termination is a staple for RL practitioners, and it is often used to improve simulation efficiency. If the character gets stuck in a state from which there is no chance of success, then the episode is terminated early, to avoid simulating the rest. Here we show that early termination can in fact have a significant impact on the results. Deep Mimic Paper: https://xbpeng.github.io/projects/DeepMimic/index.html Because this could benefit further experimentation, it would be nice if that option was available. Unfortunately, I don't see any straightforward solution for that problem... For me, in the RL framework, the action space is fixed, but I may be wrong. I think you should check how Deepmind did it for AlphaGo, but there were using tree search too... — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#108 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AmkAr14wfQEk1pO9Xw-NOz61hJ5pu6qpks5u1SzkgaJpZM4Y9lCf> .

JoelNiklaus · 2018-12-05T14:52:58Z

By using early termination, you avoid exploring regions that are not useful. I remember seeing that used in a game with invalid actions (but I can't remember in which context exactly). More recently, that was apparently key in the success of Deep Mimic (see link below).`

I see. But in my opinion in my game this is not necessary, because the game is only running for 9 timesteps.

Unfortunately, I don't see any straightforward solution for that problem... For me, in the RL framework, the action space is fixed, but I may be wrong. I think you should check how Deepmind did it for AlphaGo, but there were using tree search too...

Yes, I think UCT is the next thing I am going to try.

This would kind of defeat the purpose of using a neural net, but could you
define a mask that said whether to use the neural net output at a state or
instead use a hard coded negative value (such as -1e6)... then when you do
a prediction with the neural net, you could apply the mask... including
when you are extracting your policy.

Interesting. So if I understand this correctly you either take the output of the neural net or this large negative value as an action. I don't quite see how this would solve the problem though. I guess internally the NN is outputting probabilities for each action in the action space. The goal is, to set the probabilities of the invalid action to 0.

ardabbour · 2019-01-10T21:27:30Z

Hi, I ran into a similar problem when developing an agent to play Backgammon. I found two ways to work around this; neither is good but definetely worth a shot:

Make your reward function binary: +1 for valid actions, -1 for invalid actions (and take an action randomly when stepping), then train for as long as needed to teach the agent 'the rules of the game'. Save that agent, then revert to your regular reward function (with invalid actions still producing high negative rewards and stepping with random valid actions), load and continue training your agent.
Extract the action probabilities (I'm assuming your action space is discrete), and choose the valid action with the highest probability.

JoelNiklaus · 2019-01-11T02:24:24Z

Hi,
Thank you for the hints. I thought of doing this too.
Did you achieve maximal reward when learning the rules?
When I trained for the rules, I did not achieve maximal reward in every case, even after 5M games played.

ardabbour · 2019-01-12T14:00:50Z

I am not sure what you mean by maximal reward in every case; if you mean the maximum theoretical reward, then I did not attempt to push the agent to that level. You might not see that unless your games played go to infinity.

That will depend on the algorithm you are using, some algorithms will deliberately avoid the best choice available - it has to do with the whole exploration vs. exploitation idea. Also, what you end up with (the neural network) is a probability distribution function which means that there is a randomness factor in choosing which action to take.

JoelNiklaus · 2019-01-13T03:35:55Z

Yes, I mean the maximum theoretical reward (in the rules setting only). If we really want to learn the rules, the maximum theoretical reward is the goal, right? Otherwise we still cannot be sure that the agent is not taking invalid actions.

RIght. Thank you.

araffin · 2019-07-02T19:31:41Z

I will close this issue in favor of #351

araffin added question Further information is requested custom gym env Issue related to Custom Gym Env labels Dec 2, 2018

araffin mentioned this issue Feb 22, 2019

Custom policy that only samples from legal actions - PPO2 [question] #212

Closed

araffin mentioned this issue Jun 5, 2019

[Feature Request] Invalid Action Mask #351

Closed

araffin closed this as completed Jul 2, 2019

habara-k mentioned this issue Aug 17, 2021

pfrl で学習を動かしてみる mjx-project/mjx#892

Closed

armiantos mentioned this issue Mar 29, 2022

Define RL model for chess ECE493-W22-G04/rl-chess#31

Merged

Cyazd mentioned this issue Mar 30, 2022

Invalid Actions, Mask and DQN #1155

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to choose limit the action space differently for each step? [question] #108

Is it possible to choose limit the action space differently for each step? [question] #108

JoelNiklaus commented Dec 2, 2018

araffin commented Dec 2, 2018

JoelNiklaus commented Dec 3, 2018

araffin commented Dec 3, 2018

bertram1isu commented Dec 3, 2018 via email

JoelNiklaus commented Dec 5, 2018 •

edited

Loading

ardabbour commented Jan 10, 2019

JoelNiklaus commented Jan 11, 2019

ardabbour commented Jan 12, 2019

JoelNiklaus commented Jan 13, 2019

araffin commented Jul 2, 2019

Is it possible to choose limit the action space differently for each step? [question] #108

Is it possible to choose limit the action space differently for each step? [question] #108

Comments

JoelNiklaus commented Dec 2, 2018

araffin commented Dec 2, 2018

JoelNiklaus commented Dec 3, 2018

araffin commented Dec 3, 2018

bertram1isu commented Dec 3, 2018 via email

JoelNiklaus commented Dec 5, 2018 • edited Loading

ardabbour commented Jan 10, 2019

JoelNiklaus commented Jan 11, 2019

ardabbour commented Jan 12, 2019

JoelNiklaus commented Jan 13, 2019

araffin commented Jul 2, 2019

JoelNiklaus commented Dec 5, 2018 •

edited

Loading