Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it better to use mixed value approximation? #69

Closed
CGLemon opened this issue Jun 17, 2023 · 5 comments · Fixed by #78
Closed

Is it better to use mixed value approximation? #69

CGLemon opened this issue Jun 17, 2023 · 5 comments · Fixed by #78
Labels
enhancement New feature or request question Further information is requested

Comments

@CGLemon
Copy link
Contributor

CGLemon commented Jun 17, 2023

In the paper (Appendix D), DeepMind used the mixed value approximation instead of simple one. It seems that your implementation is simple one. In my experience, the simple one can work on 9x9. But it is crashed on the 19x19. So maybe it is better choice to use mixed value approximation?

    def calculate_completed_q_value(self) -> np.array:

        ~~~~~~~~~~~~~~

        sum_prob = np.sum(policy)
        v_pi = np.sum(policy * q_value)

        return np.where(self.children_visits[:self.num_children] > 0, q_value, v_pi / sum_prob)
@kobanium kobanium added enhancement New feature or request question Further information is requested labels Jun 20, 2023
@kobanium
Copy link
Owner

kobanium commented Jun 20, 2023

This is because I didn't understand the calculation of v_mix value. So TamaGo must use mixed value approximation. Although I want to change from simple value to mixed value approximation, I'm too busy to change it. I'll change it when I have enough time.

By the way, my experiment on Ray, reinforcement learning using simple value had been done well on 19x19 (16visits/move). So I'm curious why your experiment on 19x19 failed.

@CGLemon
Copy link
Contributor Author

CGLemon commented Jun 21, 2023

oh... Maybe your implement is different with my implement. Does Ray resacle the Q value in the Gumbel process?

@CGLemon
Copy link
Contributor Author

CGLemon commented Jun 21, 2023

I forgot to explain the v_mix value. The format is very simple. That is

    sum_prob = np.sum(policy)
    v_pi = np.sum(policy * q_value)
    rhs = v_pi / sum_prob

    lhs = parent_nn_value
    factor = np.sum(self.children_visits)

    v_mix = (1 * lhs + factor * rhs) / (1 + factor)

@kobanium
Copy link
Owner

Thanks for a snippet! Certainly, it is easy to implement.

I don't rescale Q-value because value network output's range is from 0.0 to 1.0.
I think I shouldn't rescale Q-value. It is very sensitive for targets of reinforcement learning process.

@CGLemon
Copy link
Contributor Author

CGLemon commented Jun 27, 2023

Seem that the rescaling is not necessary for AlphaZero. What’s worse is that it may make the policy too sharp. I fix this issue in my main run. The result shows the new weights is better than before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants