[Computer-go] mini-max with Policy and Value network

Gian-Carlo Pascutto gcp at sjeng.org
Tue May 23 03:46:54 PDT 2017

On 22-05-17 21:01, Marc Landgraf wrote:
> But what you should really look at here is Leelas evaluation of the game.

Note that this is completely irrelevant for the discussion about
tactical holes and the position I posted. You could literally plug any
evaluation into it (save for a static oracle, in which case why search
at all...) and it would still have the tactical blindness being discussed.

It's an issue of limitations of the policy network, combined with the
way one uses the UCT formula. I'll use the one from the original AlphaGo
paper here, because it's public and should behave even worse:

u(s, a) = c_puct * P(s, a) * sqrt(total_visits / (1 + child_visits))

Note that P(s, a) is a direct factor here, which means that for a move
ignored by the policy network, the UCT term will almost vanish. In other
words, unless the win is immediately visible (and for tactics it won't),
you're not going to find it. Also note that this is a deviation from
regular UCT or PUCT, which do not have such a direct term and hence only
have a disappearing prior, making the search eventually more exploratory.

Now, even the original AlphaGo played moves that surprised human pros
and were contrary to established sequences. So where did those come
from? Enough computation power to overcome the low probability?
Synthesized by inference from the (much larger than mine) policy network?


More information about the Computer-go mailing list