-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API discussion #4
Comments
Thanks for your interest in Game InterfaceI agree with your assessment, modulo a few remarks:
More generally, I completely agree with the goal of moving towards a standard game interface. MCTS
|
Thanks for your explanation. Things are more clear to me now.
I see. Sorry that I might have confused myself with the concept of oracle in active learning.
That's an amazing improvement! Since you mentioned MuZero, do you have any plan to implement it? 😄 |
You're welcome.
This is a great question. At some point, I think it should be possible to have a framework where both algorithms can just be instantiated by assembling slightly different sets of primitives. This is another thing that should guide our API discussion. |
I just attempted to adapt the OpenSpiel/RLBase to the existing GameInterface. This is an easier way to verify whether we can safely replace GameInterface with RLBase or not. The biggest problem now is the I have a feeling that it won't be that easy to drop it. But we do have to rethink which parts are essential to the game we are about to handle and what are the enhancements. The following are some ad-hoc codes to implement a new Game with OpenSpiel.jl. using OpenSpiel
using ReinforcementLearningBase
using ReinforcementLearningEnvironments
import AlphaZero.GI
tic_tac_toe = OpenSpielEnv("tic_tac_toe")
GI.Board(::Type{<:OpenSpielEnv}) = typeof(tic_tac_toe.state)
GI.Action(::Type{<:OpenSpielEnv}) = Int
# !!!
GI.actions(::Type{<:OpenSpielEnv}) = get_action_space(tic_tac_toe).span
Base.copy(g::OpenSpielEnv{O,D,S,G,R}) where {O,D,S,G,R} = OpenSpielEnv{O,D,S,G,R}(copy(g.state), g.game, g.rng)
GI.actions_mask(g::OpenSpielEnv) = get_legal_actions_mask(observe(g))
GI.board(g::OpenSpielEnv) = g.state
GI.white_playing(g::OpenSpielEnv) = get_current_player(g) == 0
function GI.white_reward(g::OpenSpielEnv)
obs = observe(g, 0)
get_terminal(obs) ? get_reward(obs) : nothing
end
GI.play!(g::OpenSpielEnv, pos) = g(pos)
GI.vectorize_board(::Type{<:OpenSpielEnv}, board) = get_state(board)
GI.available_actions(g::OpenSpielEnv) = get_legal_actions(observe(g))
OpenSpielEnv{O,D,S,G,R}(state) where {O,D,S,G,R} = OpenSpielEnv{O,D,S,G,R}(state, tic_tac_toe.game, tic_tac_toe.rng)
Base.:(==)(a::typeof(tic_tac_toe.state), b::typeof(tic_tac_toe.state)) = observation_tensor(a) == observation_tensor(b)
GI.canonical_board(g::OpenSpielEnv) = g.state |
Thanks @findmyway for having a deeper look into integrating AlphaZero.jl with OpenSpiel/RLBase. I agree with you that the biggest obstacle right now is the symmetry assumption, which is exploited in several different places. As I explained in a previous post, I used to believe that it was the right thing to do (at least for the standard board games where AlphaZero has been applied first and which happen to be symmetric) but I am now getting convinced that the resulting optimizations are not worth the cost in terms of generality. I agree that dropping it will require a fair amount of refactoring across the codebase but the codebase is pretty small to start with so nothing too scary here! I am happy to take a stab at it myself. |
That's great. Feel free to ping me if there's any progress. (By the way, could you tag a release before refactoring? So that we can easily compare and discuss the diffs) |
I will tag a release. I just wanted to wait a little bit in case someone uncovers a problem right after the public announcement. |
I personally find the mcts implimentation here a little hard to follow. I think the most readable I've seen is https://github.com/dkappe/a0lite. The master branch has a very simple version, but the fancy branch also has batching and pruning. |
@oscardssmith Thanks for the feedback. Could you be more specific and tell me what made it hard to follow for you? One thing that probably makes it confusing is that it contains two implementations in one. There is both a synchronous and an asynchronous version of MCTS, that share most of their code. I should also probably add comments to explain the overall structure. :-) |
A lot of it is naming and ordering I think. I'm pretty sure |
@oscardssmith You are raising a very interesting point, which prompted me to do some research. From what I've seen, this is actually a bit of a controversial issue. Merging nodes that correspond to identical states has actually been presented as an MCTS optimization. See for example section 5.2.4 in this 2012 MCTS survey from Browne et al.. Intuitively at least, this makes sense: you do not want to waste computing time estimating the value of a same state several times (at least when the system is Markovian). Some people seem to disagree on whether or not this may bias exploration/exploitation. I personally found the argument for merging more convincing. Here, people are warning that transposition tables should be used with care when imperfect information is involved, which is unsurprising. AlphaZero.jl is targeting games of perfect information though. I would be interested to have your opinion on this debate. |
Suppose you have the following part of a move graph: A->B->D, A->C->D, C->E, where A is your turn. Let D be good for you, and E be bad for you. If you merge the two D nodes, then C will always get more visits than B since the only difference is it has one extra child. Thus your MCTS converges to pick an arbitrarily bad move. |
I disagree that C will get more visits than B. With the current implementation, I would expect MCTS to ultimately only visit B. I guess we could test this claim easily but I think our misunderstanding might be resolved by noting that in the current implementation, there are separate counters for "number of visits of D coming from B" and "number of visits of D coming from "C". These numbers, which we can write N_B(D) and N_C(D), are stored in nodes B and C respectively. Therefore, observing a success after traversing the path A -> B -> D does not increase N_A(C) and therefore does not encourage further exploration of C. |
You're right that I hadn't noticed that. That, however raises a couple questions: If the value of D drops, does C become aware of that drop ever? If you only propagate your n's back along the 1 path you visited, do you also only propagate the win statistics? Also, does this mean that D has a different N when visited via B and C? |
You are raising an excellent point. The thing is: there is no such thing as an estimate of the value of D. There are only estimates of Q_B(D) and Q_C(D). (I am abusing notation slightly here, I hope you don't mind.) Observing a success after traversing A -> B -> D would update Q_B(D) but not Q_C(D).
Yes, but this is not a problem as you're only updating a Q function estimate.
Yes. N_B(D) and N_C(D) are completely distinct values. |
Overall, I think that what we are coming at is this: there are two fundamentally different ways to implement MCTS. One in which you try to approximate the value function, and one in which you try to approximate the Q function. This raises two questions:
I believed that Deepmind's AlphaGo Zero was using an implementation similar to mine. Indeed, reading their paper, I thought it was pretty clear they are storing Q-values and not values. But you're making me doubt this now. |
I just re-read the relevant parts of the 3 A0 papers (alphago, alphagozero, alphazero), and I think I can better explain how Lc0 has done it. Basically, if you note that Q(s,a)=V(s'), and N(s,a)=N(s'), you now have a way to store your Q and N in the nodes rather than edges. The edges then store Moves and Policies. The nice things about this design are that you don't need to store board states anywhere (since you can just make and unmake moves), and it ends up being memory efficient since you never are initializing things you don't need. |
I think you are right. The implementation you're referring to is more memory efficient. On the other hand, my implementation makes it easier to merge nodes. The Lc0 approach is probably a better default though, and I may implement it at some point. In any case, I should add comments in the code for clarity. Thanks for this very interesting conversation. :-) |
@findmyway Can you explain what you mean by Worker? |
This refers to and old MCTS implementation that has been replaced since. Regarding multi-player games, there are two possible situations:
Also, I implemented full support for MDPs and compatibility with CommonRLInterface.jl, which can be found on the common-rl-intf branch. I am going to merge with master and release v0.4 as soon as Julia 1.5.3 gets released with this fix. |
@jonathan-laurent And there is another error: GKS: Open failed in routine OPEN_WS |
I think the segfault on 1.5.2 was unrelated to GPU handling and it was fixed by this PR: JuliaLang/julia#37594. Regarding your other error (GKS: can't connect to GKS socket application), it is a known problem with Plots.jl (see JuliaPlots/Plots.jl#1649) but it does not crash the program and does not seem to keep the plots from being generated either. |
I wonder what you changed in the commonrlintf branch that this GKS error is met. True, it does not stop the program and the plots are being generated correctly. However, was your intention to display the plots right during the training? The segfault appears to happen inconsistently(sometimes after 1 iteration, sometime after 2nd iteration) on CUDA11.0. But on CUDA11.1 even after 20 iterations there was no segfault. `Starting self-play
|
Actually, I've had this error since the first release on my machine so I am not sure it has to do with
Interesting. I also had some issues with CUDA11.0. I am wondering what changed with 11.1. |
In the commom-rl-intf branch you removed:
from scripts/alphazero.jl I believe that's why the GKS warning was being prompted. Adding these two lines back allowed me to train the connect-four as well. |
@deveshjawla What example are you talking about? |
Hi, it's the trading game. The grid world example trains fine where the average reward across iterations varies. |
Ok, the average reward in the Alphazero benchmark is increasing with each iteration now. Solved. |
Hi @jonathan-laurent ,
This project is really awesome!
Since you mentioned it in the doc Develop-support-for-a-more-general-game-interface, I'd like to write down some thoughts and discuss them with you.
Here I'll mainly focus on the Game Interface and MCTS parts. In the meanwhile, the design differences between AlphaZero.jl, ReinforcementLearningBase.jl and OpenSpiel are also listed.
Game Interface
To implement a new game, we have some assumptions according to the Game Interface:
If I understand it correctly, two main concepts are Game and Board.
In OpenSpiel, those two concepts are almost the same (the Board is named state in OpenSpiel), except that the state is not contained in the Game, which means Game is just a static description (history is not contained in game but state).
In RLBase, the Game is treated as an AbstractEnvironment and the Board is just the observation of the env from the aspect of a player.
In this view, most of the interfaces in this package are aligned with those in RLBase. Following are the detailed description:
AbstractGame
->AbstractEnv
board
->observe
Action
->get_action_space
white_playing
->get_current_player
white_reward
->get_reward
board_symmetric
-> missing in RLBase. Need to define a new trait to specify whether the state of a game is symmetric or notavailable_actions
->get_legal_actions
actions_mask
->get_legal_actions_mask
play!
->(env::Abstractenv)(action)
heuristic_value
-> missing in RLBase.vectorize_board
->get_state
symmetries
-> missing in RLBasegame_terminated
->get_terminal
num_actions
->length(action_space)
board_dim
->size(rand(observation_space)
random_symmetric_state
-> missing in RLBaseI think it won't be very difficult to adapt to use OpenSpiel.jl or even to use the interfaces in RLBase.
MCTS
I really like the implementation of asnyc MCTS in this package. I would like to see it is separated as a standalone package.
The naming of some types is slightly strange to me. For example, there's an
Oracle{Game}
abstract type. If I understand it correctly, it is used in the rollout step to select an action. The first time I saw the name of Oracle, I supposed its subtypes must implement some smart algorithms 😆 . But in MCTS it is usually a light-weight method, am I right?The implementation of
Worker
assumes that there are only two players in the game. Do you have any idea how to expand it to apply for multi-players games?At the first glance, I thought the async MCTS used some kind of root level or tree level parallelization. But I can't find that the multi-threading is used in the code anywhere. It seems that the async part is mainly to collect a batch of states and get the evaluation results once for all. Am I right here? It would be better if you could share some implementation considerations here 😄
Also cc @jbrea 😄
The text was updated successfully, but these errors were encountered: