[Computer-go] Thought on LeelaZero training

patrick.bardou at laposte.net patrick.bardou at laposte.net
Fri Mar 9 06:49:22 PST 2018


Hi, 


  

Generating self-play games represents smost of the computation burden in LZ project. With current setting, games are generated with a 1000 nodes/move budget. As a rough guide, considering 250 moves game length and ignoring resign and possible tree reuse, I assume generating a game, thus a training sample, costs ~250000 nodes in terms of computation. 

I am wondering if one can get more juice from this computation budget. 

* The Apprentice (policy head of the network) has some current playing strength (e.g. if used as greedy player). 
* MCTS guided by the Apprentice (policy and value heads) makes a stronger Player, which is used to generate the the games from which training positions will be sampled. In LZ, games are played by a Player using Np=1000 Nodes (Nd) budget for each move. 
* For each training sample, target for imitation learning of the Apprentice (policy head) is estimated by an Expert, stronger than the Apprentice. The Expert evaluates a position also based on a guided-MCTS search, using Ne nodes budget. In AGZ/AZ/LZ current setting, this Expert is just the Player itself, i.e. training targets derive from game playing data, i.e. visits count at the root for the sampled position. Ne=Np is indeed free lunch. 

A t some point in the training, when strength graph is plateauing, let aside increasing the network size once more, one might consider to: 
(1) sample training examples from games self-played by a stronger Player, by increasing the node budget per move Np (e.g. 1000Nd--> 2000 Nd); this would slow down the samples generation proportionaly. 
(2) keep Np unchanged and only increase the strength of the Expert. This could be done by pushing the MCTS search of the sampled position to a higher node budget Ne > Np. For example, self-playing games still at 1000Nd/moves but evaluating sampled positions with 25000Nd budget. An extra-cost of +10% on overall computation time, for (presumably) better targets, resulting in (possibly, yet to be proved) higher ELO gain per sample (from 1000Nd to 5000Nd targets would cost 'only' +2% overall cost). 

The bottom line is of course the return on investment: would the ELO gain speedup (per sample), if any, overcome the 10% slowdown in games generation rate ? What matters more: generating games at higher rate, increasing Player strength or increasing Expert strength ? Is there a sweet spot there ? This may vary a lot according to where we stand on the learning curve. 

The bottom line is of course the return on investment: would the ELO gain speedup (per sample), if any, overcome the 10% slowdown in games generation rate ? What matters more: generating games at higher rate, increasing Player strength or increasing Expert strength ? Is there a sweet spot there ? This may vary a lot according to where we stand on the learning curve. 



I am fully aware that AGZ/AZ papers definitely prove that super-human level can be reached with a (quite) low, constant node/move budget and Expert=Player setting, as now used by LZ. But with tremendous computation power on DeepMind side in the case of AZ. To reach the ~5200 ELO, 75% of AZ training (30/40 days) occured in an asymptotic region > 4500 ELO. 3 days to reach AlphaGo Lee level from scratch, but 27 extra-days to reach AlphaGo master level. If the goal of the project is to replicate AGZ/AZ up to Master level, plateauing will be daily bread for a very long time. Hence my thought-question. 

I didn't find any indication/attempt in the planning-based RL litterature dealing with a split between Apprentice / Player (or Actor) / Expert (or Critic). Only extreme cases like: 

- In "Thinking Fast and Slow with Deep Learning and Tree Search" ( https://arxiv.org/pdf/1705.08439.pdf ), Np=1 (greedy policy or stochastic policy), Player = Apprentice 
- In AlphaGo Zero or Alpha Zero papers, Ne=Np, tree search data from played games are re-used as Expert's evaluation (i.e. Expert = Player). 

A major drawback regarding this split, with current LZ distributed scheme (if I got it right) is that sampling of positions would have to be decided at worker level, once the game ended, and an extra MCTS search would have to be run with increased Ne nodes budget, locally at worker level, before pushing the data to the training server. Or improved targets would have to be generated a posteriori in a centralized manner on training server, which might represent unacceptable burden. 

Regards, 
Patrick 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://computer-go.org/pipermail/computer-go/attachments/20180309/ade84816/attachment.html>


More information about the Computer-go mailing list