[Computer-go] Playout policy optimization
sheppardco at aol.com
Sun Feb 12 10:14:39 PST 2017
If your database is composed of self-play games, then the likelihood maximization policy should gain strength rapidly, and there should be a way to have asymptotic optimality. (That is, the patterns alone will play a perfect game in the limit.)
Specifically: play self-play games using an asymptotically strong search policy, such as UCT, and then add all of the moves played by winning players to the training set. (Never train to the moves played by losers if your goal is asymptotic optimality.)
The intuition about why this results in strong play: at any moment in self-play training, the search engine is always working on situations that it believes are most equal for both sides. But then it turns out that one side is better and wins, so that data is added to the training set. Reinforcement learning then biases the pattern base to make the winning pathway more likely in future searches. The winning alternatives will be easier to see in future self-play games. Inevitably the losing player will prefer a different path, and the learning process continues.
Regarding asymptotic optimality: obviously you need a pattern set that could potentially be asymptotically optimal. For example, a general decision tree, rather than any fixed field of view. If you have such a thing, then the proof for asymptotic optimality goes along these lines: first show that the tree search and pattern base combination will eventually memorize the winning moves at terminal nodes, and then apply the same reasoning recursively.
Regarding trying this: I was just starting to code this when AlphaGo debuted, and then I stopped working on Go. So: no, I have not tried this.
From: Computer-go [mailto:computer-go-bounces at computer-go.org] On Behalf Of Álvaro Begué
Sent: Saturday, February 11, 2017 11:44 PM
To: computer-go <computer-go at computer-go.org>
Subject: [Computer-go] Playout policy optimization
I remember an old paper by Rémi Coulom ("Computing Elo Ratings of Move Patterns in the Game of Go") where he computed "gammas" (exponentials of scores that you could feed to a softmax) for different move features, which he fit to best explain the move probabilities from real games.
Similarly, AlphaGo's paper describes how their rollout policy's weights are trained to maximize log likelihood of moves from a database.
However, there is no a priori reason why imitating the probabilities observed in reference games should be optimal for this particular purpose.
I thought about this for about an hour this morning, and this is what I came up with. You could make a database of positions with a label indicating the result (perhaps from real games, perhaps similarly to how AlphaGo trained their value network). Loop over the positions, run a few playouts and tweak the move probabilities by some sort of reinforcement learning, where you promote the move choices from playouts whose outcome matches the label, and you discourage the move choices from playouts whose outcome does not match the label.
The point is that we would be pushing our playout policy to produce good estimates of the result of the game, which in the end is what playout policies are for.
Any thoughts? Did anyone actually try something like this?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Computer-go