[Computer-go] action-value Q for unexpanded nodes
ajahuang at gmail.com
Wed Dec 6 07:11:07 PST 2017
2017-12-06 13:52 GMT+00:00 Gian-Carlo Pascutto <gcp at sjeng.org>:
> On 06-12-17 11:47, Aja Huang wrote:
> > All I can say is that first-play-urgency is not a significant
> > technical detail, and what's why we didn't specify it in the paper.
> I will have to disagree here. Of course, it's always possible I'm
> misunderstanding something, or I have a program bug that I'm mixing up
> with this.
No matter I agree with you or not, unfortunately it's not up to me to
decide whether I can answer the question, even if I am personally happy to
(in fact, this post might be already exceeding my barrier a bit). I hope
you understand, and good luck with making it works.
I'm very happy the two Go papers we published have helped the Go community.
My dream was fulfilled and I've switched to pursue other challenges. :)
> Or maybe you mean that you expect the program to improve regardless of
> this setting. In any case, I've now seen people state here twice that
> this is detail that doesn't matter. But practical results suggest
> For a strong supervised network, FPU=0 (i.e. not exploring all successor
> nodes for a longer time, relying strongly on policy priors) is much
> stronger. I've seen this in Leela Zero after we tested it, and I've
> known it to be true from regular Leela for a long time. IIRC, the strong
> open source Go bots also use some form of progressive widening, which
> produces the same effect.
> For a weak RL network without much useful policy priors, FPU>1 is much
> stronger than FPU=0.
> Now these are relative scores of course, so one could argue they don't
> affect the learning process. But they actually do that as well!
> The new AZ paper uses MCTS playouts = 800, and plays proportionally
> according to MCTS output. (Previous AGZ had playouts = 1600,
> proportional for first 30 moves).
> Consider what this means for the search probability outputs, exactly the
> thing the policy network has to learn. With FPU=1, the move
> probabilities are much more uniform, and the moves played are
> consequentially much more likely to be bad or even blunders, because
> there are less playouts that can be spent on the best move, even if it
> was found.
> > The initial value of Q is not very important because Q+U is
> > dominated by the U piece when the number of visits is small.
> a = Q(s, a) + coeff * P(s,a) * (sqrt(parent->visits) / 1.0f +
> Assume parent->visits = 100, sqrt = 10
> Assume child->visits = 0
> Assume P(s, a) = 0.0027 (near uniform prior for "weak" network)
> The right most side of this (U term) is ~1. This clearly does not
> dominate the Q term. If Q > 1 (classic FPU) then every child node will
> get expanded. If Q = 0 (Q(s, a) = 0) then the first picked child
> (largest policy prior) will get something like 10 expansions before
> another child gets picked. That's a massive difference in search tree
> shape, *especially* with only 800 total playouts.
> Computer-go mailing list
> Computer-go at computer-go.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Computer-go