[Computer-go] action-value Q for unexpanded nodes
gcp at sjeng.org
Wed Dec 6 01:23:09 PST 2017
On 03-12-17 17:57, Rémi Coulom wrote:
> They have a Q(s,a) term in their node-selection formula, but they
> don't tell what value they give to an action that has not yet been
> visited. Maybe Aja can tell us.
FWIW I already asked Aja this exact question a bit after the paper came
out and he told me he cannot answer questions about unpublished details.
This is not very promising regarding reproducibility considering the AZ
paper is even lighter on them.
Another issue which is up in the air is whether the choice of the number
of playouts for the MCTS part represents an implicit balancing between
self-play and training speed. This is particularly relevant if the
evaluation step is removed. But it's possible even DeepMind doesn't know
the answer for sure. They had a setup, and they optimized it. It's not
clear which parts generalize.
(Usually one wonders about such things in terms of algorithms, but here
one wonders about it in terms of hardware!)
More information about the Computer-go