[Computer-go] AGZ bootstrapping - Benefit of multi-labelled value net

patrick.bardou patrick.bardou at laposte.net
Thu Oct 26 13:51:45 PDT 2017


@Gian-Carlo,
Indeed, multi-labelled value net/ head sounds a good way to inject more signal into the network, accorging to that paper, thus to inject more reinforcemenf learning signal for learning from scratch.
I was wondering if it could also be beneficial for bootstrapping the policy net/ head as well. Since score of randomly played games is likely to have high variance, I suppose that moves in most positions close to a final position will have very similar action values, as estimated  by the MCTS search, hence weak reinforcement signal (all/most of the moves leading to the same outcome, either win or loss). Using a multi-labelled value head could produce multi-labelled action values which might have more spread than with a fixed 6.5 komi and would allow better ranking of moves (by playing on the komi / averaging over the various possible komi ?)
Otherwise said, learning to increase the final score might be a good starting point for the policy net / head. At least in bootstrapping phase.
Combined with prioritized sampling biased towards low reverse move count from endgame, as I mentionned in an earlier post and as you propose for first round of learning win/loss of final position.
Patrick

-------- Message d'origine --------
De : computer-go-request at computer-go.org 
Date : 26/10/2017  16:17  (GMT+01:00) 
À : computer-go at computer-go.org 
Objet : Computer-go Digest, Vol 93, Issue 34 


------------------------------

Message: 2
Date: Thu, 26 Oct 2017 15:17:43 +0200
From: Gian-Carlo Pascutto <gcp at sjeng.org>
To: computer-go at computer-go.org
Subject: Re: [Computer-go] AlphaGo Zero
Message-ID: <8c872e71-4864-0a19-d3df-9fe1c48d22b8 at sjeng.org>
Content-Type: text/plain; charset=utf-8

On 25-10-17 16:00, Petr Baudis wrote:
> That makes sense.  I still hope that with a much more aggressive 
> training schedule we could train a reasonable Go player, perhaps at
> the expense of worse scaling at very high elos...  (At least I feel 
> optimistic after discovering a stupid bug in my code.)

By the way, a trivial observation: the initial network is random, so
there's no point in using it for playing the first batch of games. It
won't do anything useful until it has run a learning pass on a bunch of
"win/loss" scored games and it can at least tell who is the likely
winner in the final position (even if it mostly won't be able to make
territory at first).

This suggests that bootstrapping probably wants 500k starting games with
just random moves.

FWIW, it does not seem easy to get the value part of the network to
converge in the dual-res architecture, even when taking the appropriate
steps (1% weighting on error, strong regularizer).

-- 
GCP


------------------------------

Message: 3
Date: Thu, 26 Oct 2017 15:55:23 +0200
From: Roel van Engelen <gosubaduk at gmail.com>
To: computer-go at computer-go.org
Subject: Re: [Computer-go] Source code (Was: Reducing network size?
	(Was: AlphaGo Zero))
Message-ID:
	<CA+RUuO+hqs_rKnoVEQCDzV8oO=E4B9_S46JuOqAgND8Jg+4+sg at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

@Gian-Carlo Pascutto

Since training uses a ridiculous amount of computing power i wonder if it
would
be useful to make certain changes for future research, like training the
value head
with multiple komi values <https://arxiv.org/pdf/1705.10701.pdf>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://computer-go.org/pipermail/computer-go/attachments/20171026/33fa6d3f/attachment.html>


More information about the Computer-go mailing list