[Computer-go] Alphago Zero - Pulling on your shoelaces with a 1600N force

patrick.bardou at laposte.net patrick.bardou at laposte.net
Mon Oct 23 09:27:41 PDT 2017



Deepmind's paper on Alphago Zero (AGZ) gave an unexpected and beautiful answer to the question I confusedly posted here a while ago: 


http://computer-go.org/pipermail/computer-go/2017-January/009828.html 
http://computer-go.org/pipermail/computer-go/2017-February/009894.html 

  

==> MCTS is indeed a powerfull policy improvement operator (for go at least). Sounds like you can lift yourself up by pulling on your shoelaces, as we say in France. 


What stikes me however is one aspect of the "from scratch" feature of this MCTS policy improvement by self-play, for the following reasons: 

  

If I got it well, the workhorse of this is: 

  

    Pull (p,v) towards (P,V)=MTCS(p,v,N), 

  

where (P,V) are the refined policy and value estimations resulting from the a MCTS search done with some exploration budget N (e.g. 1600 nodes as in Deepmind paper). If p and v already contain some go knowledge, it is concievable that MTCS(p,v,N) will provide a better target for improving (p,v). We know that MCTS even with pure random roll-out improves over random play. 

  

But, with randomly initialized (p,v), in the initial phase of learning, (P,V)=MTCS(p,v,N) stays completely agnostic of the goal of the game until at least a leaf node reaches a terminal 

position. In this starting phase, the learning 'signal' can only come from leaf terminal positions, for which the evaluation by the value network is replaced by ground truth = game outcome. 

  

Without terminal positions leaf nodes in the tree, the MCTS operator can definitely improved nothing from scratch. 

  

1) Getting some signal for value network: 

  

Now, consider the 1600 nodes exploration budget reported by Deepmind. A SL trained policy network will distribute mass density sparsely over a restricted number of moves and a 1600 nodes MCTS tree might be quite deep. Inversely, a randomly intitialized policy will (a priori, in my intuition) distribute probability over a much larger number of moves, at each level of the tree. Consequently, a 1600 nodes tree will probably be rather wide and not so deep. Let's assume it is on average no deeper than 40 (maybe I'm wrong here; but you can replace with 80, it wouldn't change my point). 

  

Since "AlphaGo Zero uses Tromp–Taylor scoring during MCTS simulations and self-play training", game lentgh might be quite long, much longer than the standard 200-300 moves of human. The paper indeed mentions: "Games terminate when both players pass or after 19 × 19 × 2 = 722 moves" (i.e. allowing many captures and recaptures). Let's assume a conservative 400 moves average game length. 

  

Assuming a uniform sampling of positions (as in the paper), this means that the trees developped from 90% of the sampled positions will contain no terminal positions and will thus be completely useless for bootstrapping (just pseudo-random noise vs the goal of the game). And for (P,V)=MTCS(p,v,N) to be a enhanced target over (p,v), the search tree should preferably contain not just one but a substantial number of terminal positions, which may retrict to a fewer percents the proportion of positions with a significant 'signal' on which to bootstrap (p,v). 

  

2) Getting some contrast to learn a policy 

  

Bootstrapping a policy network seems challenging, even with an ambryo of value network. Between strong player, in no major mistake happen, the game can keep balanced all through the game and score might be close to zero (-3.5 pts, +5pts e.g.) so that moves matters until the end and may drastically influence the outcome (+1/-1). 

Inversely, in AGZ early stage, since the games are self-played by a randomly initialized policy, they amounts more or less to a random walk and should result in a high variance in the scores. Most of the positions close to endgame (the one whith "high signal content") will be one-sided in terms of outcome, i.e. from that position, most terminal positions reached will have the same value, either +1 or -1. From the root position, most move will appear to have the same action value. There will be few "constrast" between possible moves untill the value network improves substantially. 

  

In brief, only a small fraction of the sampled positions will find some endgame signal through a 1600N (nodes) budget MTCS search, and even that fraction will be quite noisy.  E pur si muove! 

  

I had think before to the problem of learning go from scratch and had imagined, for bootstrapping a (p,v) policy and value NNs from scratch, something like: 

1) Generate a batch of self-play games until the end (through random player; good enough for a start) and score them with Tromp-Taylor rules; determine outcome z=+1/-1; 
2) Train the value network v on these terminal positions s, to predict outcome of the game, up to some accuracy. This would result a value network specialized in final endgame positions but would distill some first go knowledge in that network; 
3) Generate new games by repeating 1) 
4) Train (p,v), using the AGZ-like MCTS learning process, but restricting sampling to positions near the endgame (let say less than 20 moves before end game), so that a significant proportion of leaf nodes are terminal positions and thus recieved ground truth evaluation. This would teach the value network to generalized a bit, to almost finished game positions, and would allow to derived an embryo of policy network, also specialized in endgame. 

Then you can consider starting to pulling on you shoelaces: 
5) Repeat 3) & 4) while gradually opening the beam of endgame positions to earlier positions in the game; 
6) Once some criteria is matched (accuracy of prediction e.g.), replace random play in 3) by learned policy p and repeat 3) & 4) as long as you plateau. 

  

Of course, the foregoing is just wrong or at least not mandatory, as demonstrated by the Deepmind's result. Initial focus on endgame positions can simply be ignored. At least, Deepmind's paper does not mention applying any sampling bias towards endgame positions in the early stage of the learning process. 

  

And their graph (Fig.3) ELO vs Training time shows absolutely no ignition delay. Playing strengh increases from the very first hours. 

  

Still mysterious to me. But what a wonderful result ! 

  

Regards, 
Patrick 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://computer-go.org/pipermail/computer-go/attachments/20171023/781b3e6b/attachment-0001.html>


More information about the Computer-go mailing list