[Computer-go] Playout policy optimization
alvaro.begue at gmail.com
Sat Feb 11 20:44:18 PST 2017
I remember an old paper by Rémi Coulom ("Computing Elo Ratings of Move
Patterns in the Game of Go") where he computed "gammas" (exponentials of
scores that you could feed to a softmax) for different move features, which
he fit to best explain the move probabilities from real games.
Similarly, AlphaGo's paper describes how their rollout policy's weights are
trained to maximize log likelihood of moves from a database.
However, there is no a priori reason why imitating the probabilities
observed in reference games should be optimal for this particular purpose.
I thought about this for about an hour this morning, and this is what I
came up with. You could make a database of positions with a label
indicating the result (perhaps from real games, perhaps similarly to how
AlphaGo trained their value network). Loop over the positions, run a few
playouts and tweak the move probabilities by some sort of reinforcement
learning, where you promote the move choices from playouts whose outcome
matches the label, and you discourage the move choices from playouts whose
outcome does not match the label.
The point is that we would be pushing our playout policy to produce good
estimates of the result of the game, which in the end is what playout
policies are for.
Any thoughts? Did anyone actually try something like this?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Computer-go