[Computer-go] Reinforcement learning of move predictor in MTCS

patrick.bardou at laposte.net patrick.bardou at laposte.net
Fri Feb 10 00:16:18 PST 2017

A question / thought on move predictor used to bias search in MCTS: 

Policy network used as move recommendation function in MTCS following Alphago Nature paper is optimized by SL to predict experts moves. This policy can then be optimized by RL to win games (in greedy play mode). A MCTS agent using this RL policy as move recommendation performs less good than a MCTS with the SL policy. This raises the question of how to go beyond move predictors learnt from human experts. 


Are there any reinforcement learning method to directly optimize a move recommendation function (i.e. towards the goal of making the corresponding MCTS agent stronger) ? 
>From a RL theoritical point of view, is it possible to define a target for move recommendation function in MCTS agent ? This may also depend on how the prior probabilities are used to bias the search in MCTS. 


Quoting Silver et al. , about their variant of PUCT used in the selection phase of AG : "this search control strategy initially prefers actions with high prior probability and low visit 

count, but asympotically prefers actions with high action-value". Thus, would it be possible to used action-values estimated after some MCTS search budget (let's say 10000 nodes to 

illustrate) as targets for optimizing the move recommendation function (through conversion by a softmax e.g.) ? 


I'am aware this might be a naive approach with many pitfalls: 
- remains to be proven that such a policy would make the whole MCTS performs better; moving away from human learnt predictions might just weaken the MCTS agent ; 
- reinforcement of move predictor may not be a key issue of todays MCTS program strength 
- "asympotically prefers actions with high action-value" might be just a misleading perspective, because of the word "asymptotic" ;-) 
- generating pairs of prior / post probalities for SL would be be very expensive in practice, making this totally intractable. 




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://computer-go.org/pipermail/computer-go/attachments/20170210/2efa4933/attachment.html>

More information about the Computer-go mailing list