[Computer-go] Playout policy optimization

Gian-Carlo Pascutto gcp at sjeng.org
Mon Feb 13 00:19:09 PST 2017


On 12/02/2017 5:44, Álvaro Begué wrote:

> I thought about this for about an hour this morning, and this is what I
> came up with. You could make a database of positions with a label
> indicating the result (perhaps from real games, perhaps similarly to how
> AlphaGo trained their value network). Loop over the positions, run a few
> playouts and tweak the move probabilities by some sort of reinforcement
> learning, where you promote the move choices from playouts whose outcome
> matches the label, and you discourage the move choices from playouts
> whose outcome does not match the label.
> 
> The point is that we would be pushing our playout policy to produce good
> estimates of the result of the game, which in the end is what playout
> policies are for.
> 
> Any thoughts? Did anyone actually try something like this?

This is how Facebook trained the playout policy of Darkforest. I
couldn't tell from the paper, but inspecting the code shows exactly this
algorithm at work.

-- 
GCP


More information about the Computer-go mailing list