[Computer-go] action-value Q for unexpanded nodes
Álvaro Begué
alvaro.begue at gmail.com
Wed Dec 6 07:06:08 PST 2017
My hand-wavy argument succumbs to experimental data. And to a better
argument. :)
I stand corrected.
Thanks,
Álvaro.
On Wed, Dec 6, 2017 at 8:52 AM, Gian-Carlo Pascutto <gcp at sjeng.org> wrote:
> On 06-12-17 11:47, Aja Huang wrote:
> > All I can say is that first-play-urgency is not a significant
> > technical detail, and what's why we didn't specify it in the paper.
>
> I will have to disagree here. Of course, it's always possible I'm
> misunderstanding something, or I have a program bug that I'm mixing up
> with this.
>
> Or maybe you mean that you expect the program to improve regardless of
> this setting. In any case, I've now seen people state here twice that
> this is detail that doesn't matter. But practical results suggest
> otherwise.
>
> For a strong supervised network, FPU=0 (i.e. not exploring all successor
> nodes for a longer time, relying strongly on policy priors) is much
> stronger. I've seen this in Leela Zero after we tested it, and I've
> known it to be true from regular Leela for a long time. IIRC, the strong
> open source Go bots also use some form of progressive widening, which
> produces the same effect.
>
> For a weak RL network without much useful policy priors, FPU>1 is much
> stronger than FPU=0.
>
> Now these are relative scores of course, so one could argue they don't
> affect the learning process. But they actually do that as well!
>
> The new AZ paper uses MCTS playouts = 800, and plays proportionally
> according to MCTS output. (Previous AGZ had playouts = 1600,
> proportional for first 30 moves).
>
> Consider what this means for the search probability outputs, exactly the
> thing the policy network has to learn. With FPU=1, the move
> probabilities are much more uniform, and the moves played are
> consequentially much more likely to be bad or even blunders, because
> there are less playouts that can be spent on the best move, even if it
> was found.
>
> > The initial value of Q is not very important because Q+U is
> > dominated by the U piece when the number of visits is small.
>
> a = Q(s, a) + coeff * P(s,a) * (sqrt(parent->visits) / 1.0f +
> child->visits());
>
> Assume parent->visits = 100, sqrt = 10
> Assume child->visits = 0
> Assume P(s, a) = 0.0027 (near uniform prior for "weak" network)
>
> The right most side of this (U term) is ~1. This clearly does not
> dominate the Q term. If Q > 1 (classic FPU) then every child node will
> get expanded. If Q = 0 (Q(s, a) = 0) then the first picked child
> (largest policy prior) will get something like 10 expansions before
> another child gets picked. That's a massive difference in search tree
> shape, *especially* with only 800 total playouts.
>
> --
> GCP
> _______________________________________________
> Computer-go mailing list
> Computer-go at computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://computer-go.org/pipermail/computer-go/attachments/20171206/b3a90842/attachment-0001.html>
More information about the Computer-go
mailing list