-
Notifications
You must be signed in to change notification settings - Fork 723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible to choose limit the action space differently for each step? [question] #108
Comments
Hello,
The idea is that the agent should learn at least to take valid actions to maximize its cumulative reward. |
Thank you very much for the quick response. But I thought, that disabling invalid moves already in training might speed up the learning process. Because this could benefit further experimentation, it would be nice if that option was available. Is there any way, or would it be way too time consuming to implement something like that? If it is feasible, I am happy to provide a PR. Your third point of stopping the game early is interesting. Why could this help? |
By using early termination, you avoid exploring regions that are not useful. I remember seeing that used in a game with invalid actions (but I can't remember in which context exactly). More recently, that was apparently key in the success of Deep Mimic (see link below).
Deep Mimic Paper: https://xbpeng.github.io/projects/DeepMimic/index.html
Unfortunately, I don't see any straightforward solution for that problem... For me, in the RL framework, the action space is fixed, but I may be wrong. I think you should check how Deepmind did it for AlphaGo, but there were using tree search too... |
For a standard (non deep learning) MDP this would be handled by setting
elements of the transition matrix (or function) T( s’ | s, a ) to zero to
indicate that transition using that action is not possible.
With deep reinforcement learning where a neural net is used to estimate the
q-value, it provides estimates for every potential action regardless of
whether it’s possible.
This would kind of defeat the purpose of using a neural net, but could you
define a mask that said whether to use the neural net output at a state or
instead use a hard coded negative value (such as -1e6)... then when you do
a prediction with the neural net, you could apply the mask... including
when you are extracting your policy.
One problem I see there is the mask would consume a lot of memory if it
were stored like a table... probably on the order of the size of the
matrix T. If it could be represented as a function I might be able to be
more compactly represented...?
- J.
…On Mon, Dec 3, 2018 at 7:53 AM Antonin RAFFIN ***@***.***> wrote:
Your third point of stopping the game early is interesting. Why could this
help?
By using early termination, you avoid exploring regions that are not
useful. I remember seeing that used in a game with invalid actions (but I
can't remember in which context exactly). More recently, that was
apparently key in the success of Deep Mimic (see link below).
To quote them:
Early termination is a staple for RL practitioners, and it is often used
to improve simulation efficiency. If the character gets stuck in a state
from which there is no chance of success, then the episode is terminated
early, to avoid simulating the rest. Here we show that early termination
can in fact have a significant impact on the results.
Deep Mimic Paper: https://xbpeng.github.io/projects/DeepMimic/index.html
Because this could benefit further experimentation, it would be nice if
that option was available.
Unfortunately, I don't see any straightforward solution for that
problem... For me, in the RL framework, the action space is fixed, but I
may be wrong. I think you should check how Deepmind did it for AlphaGo, but
there were using tree search too...
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#108 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AmkAr14wfQEk1pO9Xw-NOz61hJ5pu6qpks5u1SzkgaJpZM4Y9lCf>
.
|
I see. But in my opinion in my game this is not necessary, because the game is only running for 9 timesteps.
Yes, I think UCT is the next thing I am going to try.
Interesting. So if I understand this correctly you either take the output of the neural net or this large negative value as an action. I don't quite see how this would solve the problem though. I guess internally the NN is outputting probabilities for each action in the action space. The goal is, to set the probabilities of the invalid action to 0. |
Hi, I ran into a similar problem when developing an agent to play Backgammon. I found two ways to work around this; neither is good but definetely worth a shot:
|
Hi, |
I am not sure what you mean by maximal reward in every case; if you mean the maximum theoretical reward, then I did not attempt to push the agent to that level. You might not see that unless your games played go to infinity. That will depend on the algorithm you are using, some algorithms will deliberately avoid the best choice available - it has to do with the whole exploration vs. exploitation idea. Also, what you end up with (the neural network) is a probability distribution function which means that there is a randomness factor in choosing which action to take. |
Yes, I mean the maximum theoretical reward (in the rules setting only). If we really want to learn the rules, the maximum theoretical reward is the goal, right? Otherwise we still cannot be sure that the agent is not taking invalid actions. RIght. Thank you. |
I will close this issue in favor of #351 |
Hi,
I am developing an AI for a card game. I came across your repo. You are doing a great job! :)
In my card game I have a discrete action space containing 36 values. In each step, 1 to 9 of these values are valid moves. The other values should be excluded (so the probabilities should be set to zero). I was able to implement this in the prediction phase using the action_probabilities. My question is now: How can I implement this in the learning phase?
Cheers,
Joel
The text was updated successfully, but these errors were encountered: