[Computer-go] Training the value network (a possibly more efficient approach)
gcp at sjeng.org
Wed Jan 11 05:58:46 PST 2017
On 11-01-17 14:33, Kensuke Matsuzaki wrote:
> I couldn't get positive experiment results on Ray.
> Rn's network structure of V and W are similar and share parameters,
> but only final convolutional layer are different.
> I trained Rn's network to minimize MSE of V(s) + W(s).
> It uses only KGS and GoGoD data sets, no self play with RL policy.
How do you get the V(s) for those datasets? You play out the endgame
with the Monte Carlo playouts?
I think one problem with this approach is that errors in the data for
V(s) directly correlate to errors in MC playouts. So a large benefit of
"mixing" the two (otherwise independent) evaluations is lost.
This problem doesn't exist when using raw W/L data from those datasets,
or when using SL/RL playouts. (But note that using the full engine to
produce games *would* suffer from the same correlation. That might be
entirely offset by the higher quality of the data, though.)
> But I have no idea about how to use V(s) or v(s) in MCTS.
V(s) seems potentially useful for handicap games where W(s) is no longer
accurate. I don't see any benefit over W(s) for even games.
More information about the Computer-go