[Computer-go] AlphaGo Zero
sheppardco at aol.com
Wed Oct 18 15:52:02 PDT 2017
I also expected bootstrapping by self-play. (I also wrote a post to that effect. But of course, DeepMind actually DID IT.)
But I didn't envision any of the other stuff. This is why I love their papers. Papers from most sources are predictable, skimpy, and sketchy, but theirs contain all sorts of deep insights that I never saw coming. And the theory, architecture, implementation, and explanation are all first-rate. It's like the Poker papers from U Alberta, or the source code for Stockfish. Lessons on every line.
Regarding Elo deltas: the length of Go games obscures what might be very small differences. E.g., if one player's moves are 3% more likely to be game-losing errors, then won't that player lose nearly every game? That is, 3 more "blunders" per game.
Regarding these details: at some level, all of these *must* be artifacts of training. That is, the NN architectures that did "badly" are still asymptotically optimal, so they should also eventually play equally well, provided that training continues indefinitely, and the network are large enough, and parameters do not freeze prematurely, and training eventually uses only self-play data, etc. I believe that is mathematically accurate, so I would ask a different question: why do those choices make better use of resources in the short run?
I have no idea; I'm just asking.
From: Computer-go [mailto:computer-go-bounces at computer-go.org] On Behalf Of Gian-Carlo Pascutto
Sent: Wednesday, October 18, 2017 5:40 PM
To: computer-go at computer-go.org
Subject: Re: [Computer-go] AlphaGo Zero
On 18/10/2017 22:00, Brian Sheppard via Computer-go wrote:
> This paper is required reading. When I read this team’s papers, I
> think to myself “Wow, this is brilliant! And I think I see the next step.”
> When I read their next paper, they show me the next *three* steps.
Hmm, interesting way of seeing it. Once they had Lee Sedol AlphaGo, it was somewhat obvious that just self-playing that should lead to an improved policy and value net.
And before someone accuses me of Captain Hindsighting here, this was pointed out on this list:
It looks to me like the real devil is in the details. Don't use a residual stack? -600 Elo. Don't combine the networks? -600 Elo.
Bootstrap the learning? -300 Elo
We made 3 perfectly reasonable choices and somehow lost 1500 Elo along the way. I can't get over that number, actually.
Getting the details right makes a difference. And they're getting them right, either because they're smart, because of experience from other domains, or because they're trying a ton of them. I'm betting on all 3.
Computer-go mailing list
Computer-go at computer-go.org
More information about the Computer-go