[computer-go] producing a good probability distribution over legal moves

Brian Slesinsky brian at slesinsky.org
Thu May 17 08:17:29 PDT 2007


I think there is something to this; it seems like it should be
possible to use a database of randomly selected positions from games
along with the best known followup, and use that as a faster way of
testing a program's strength than playing full games.  Such a database
would be valuable for all sorts of approaches.

The question is how you know whether the program has discovered a
better move than the best known move, or just chose another move
that's just as good?  Probably it would have to be generalized into
keeping a score for each followup, and again the question becomes how
much to trust it and how to update it.  Perhaps the scores could be
kept from running a large number of simulations and we could test how
rapidly the program converges on that using fewer resources.

A weakness of this approach is that sometimes the best move depends on
how you plan to follow it up; a program that plays the theoretically
best move but doesn't know how to follow it up is weaker than a
program that plays safer moves.

- Brian

On 5/16/07, George Dahl <george.dahl at gmail.com> wrote:
> I find Monte-Carlo Go a fascinating avenue of research, but what pains
> me is that a huge number of simulations are performed each game and at
> the end of the game the results are thrown out.  So what I was
> thinking is that perhaps the knowledge generated by the simulations
> could be collapsed in some way.
>
> Suppose that epsilon greedy versions of a reasonably strong MC Go
> program were to play a very large number of games against themselves.
> By epsilon greedy versions I mean that with probability epsilon a
> random move is played and with probability 1- epsilon the move the MC
> Player would normally play is played.  Each position in these games
> would be stored along with the Monte Calro/UCT evaluation for that
> position's desirability.  This would produce an arbitrarily large
> database of position/score pairs.  At this point a general function
> approximator / learning algorithm (such as a neural network) could be
> trained to map positions to scores.  If this was successful, it would
> produce something that could very quickly (even a large neural net
> evaluation or what have you would be much faster than doing a large
> number of MC playouts) map positions to scores.  Obviously the scores
> would not be perfect since the monte carlo program did not play
> anywhere near perfect Go.  But this static evaluator could then be
> plugged back into the monte carlo player and used to bias the random
> playouts.  Wouldn't it be useful to be able to quickly estimate the MC
> score without doing any playouts?
>
> Clearly this idea could be extended recursively with a lot of offline
> training.  What makes this formulation more valuable is that given
> enough time and effort someone familiar with machine learning should
> be able to produce a learning architecture that can actually learn the
> MC scores.  It would be a straightforward, if potentially quite
> difficult, supervised learning task with effectively unlimited data
> since more could be generated at will.  Such a learning architecture
> could be used in the manner I described above or thrown at the more
> general reinforcement learning problem.
>
>
> Does anyone have any thoughts on this idea?  Does anyone know of it
> being tried before?
>
> - George
> _______________________________________________
> computer-go mailing list
> computer-go at computer-go.org
> http://www.computer-go.org/mailman/listinfo/computer-go/
>
>


More information about the computer-go mailing list