[Computer-go] AlphaGo Zero
gcp at sjeng.org
Wed Oct 18 13:33:00 PDT 2017
On 18/10/2017 19:50, cazenave at ai.univ-paris8.fr wrote:
Select quotes that I find interesting from a brief skim:
1) Using a residual network was more accurate, achieved lower error, and
improved performance in AlphaGo by over 600 Elo.
2) Combining policy and value together into a single network slightly
reduced the move prediction accuracy, but reduced the value error and
boosted playing performance in AlphaGo by around another 600 Elo.
These gains sound very high (much higher than previous experiments with
them reported here), but are likely due to the joint training.
3) The raw neural network, without using any lookahead, achieved an Elo
rating of 3,055. ... AlphaGo Zero achieved a rating of 5,185.
The increase of 2000 Elo from tree search sounds very high, but this may
just mean the value network is simply very good - and perhaps relatively
better than the policy one. (They previously had problems there that SL
> RL for the policy network guiding the tree search - but I'm not sure
there's any relation)
4) History features Xt; Yt are necessary because Go is not fully
observable solely from the current stones, as repetitions are forbidden.
This is a weird statement. Did they need 17 planes just to check for ko?
It seems more likely that history features are very helpful for the
internal understanding of the network as an optimization. That sucks
though - it's annoying for analysis and position setup.
Lastly, the entire training procedure is actually not very complicated
at all, and it's hopeful the training is "faster" than previous
approaches - but many things look fast if you can throw 64 GPU workers
at a problem.
In this context, the graphs of the differing network architectures
causing huge strength discrepancies are both good and bad. Making a
better pick can cause you to get massively better results, take a bad
pick and you won't come close.
More information about the Computer-go