[computer-go] producing a good probability distribution over
legal moves
George Dahl
george.dahl at gmail.com
Sat May 19 13:23:40 PDT 2007
Why does this pose a problem? Presumably the monte carlo evaluator
will give the same position a similar score assuming it has enough
time. This would just cause a duplicate training pattern, or two
training patterns with identical input and slightly different output.
I guess I don't quite understand what the issue would be.
- George
On 5/17/07, Daniel Burgos <dburgos at gmail.com> wrote:
> But it is very difficult that a board position is repeated between games. I
> don't see how you will use the training pairs in the new games.
>
> 2007/5/17, George Dahl < george.dahl at gmail.com>:
> >
> > What I am actually proposing is collapsing the results of the playouts
> > offline and then having a function that maps board positions to
> > playout values without actually doing playouts. So I would use an MC
> > player to generate a lot of training pairs of the form (position,
> > score) where position would be some sort of vector representation of
> > the board position and score would be a single scalar value that
> > corresponded to the value the Monte Carlo program decided after many
> > simulations that the position had. So I would create something that
> > learned a value function for positions. Then, once training was
> > complete, this function could be evaluated very quickly and hopefully
> > give the same results that, say, 100k playouts would give. But since
> > it was so fast, the program could use the extra time to actually do
> > more playouts. The difference would be that the new playouts would be
> > biased by the value function. So the probability of picking a move
> > would be proportional to the value function evaluated on the resulting
> > position. This way I would be bootstrapping a general global position
> > evaluator and using it to improve monte carlo simulations.
> >
> > Imagine if you had a monte carlo program that took almost no time to
> > run. You would use it to do "heavy" playouts for another monte carlo
> > program to make it even stronger.
> >
> > The reason this might be easier than learning from a database of
> > professional games is that it is easy to generate scalar scores of
> > positions with a monte carlo program. Presumably it is also easier to
> > learn how to play like a mediocre monte carlo program than like a
> > professional player.
> >
> > - George
> >
> >
> >
> > On 5/17/07, Zach Keatts <shjanzey at gmail.com> wrote:
> > > What you would have after your training/evaluator phase is a hueristic
> > > knowlege of possibly better montecarlo trees to consider. This will
> > > definitely cut down on the search space, but could also alienate a
> strong
> > > search path. I have been thinking along these same line for some
> time. The
> > > problem then lies in where you can decide what trees would be worth
> looking
> > > at initially. What about a database of professional games? Take the
> > > "winning" games as examples of strong searches that ended in a win. The
> > > problem is even more complex because where in the "winning tree" do you
> tell
> > > montecarlo to start searching? Will you assign a higher probability to
> each
> > > move in those games (defining a known probabilistically stronger
> predicted
> > > result)?
> > >
> > > That is one approach. The other approach is the purely simulated
> approach
> > > where you run simulations and gradually allow your probability function
> to
> > > evolve based on the results. Although this is a purer approach I think
> the
> > > aforementioned strategy may yield some useful experimentation. The
> strong
> > > point is that it will take advantage of the human brain's innate pattern
> > > recognition and calculation skills. Since we have recorded games we
> have
> > > plenty of examples of this "thought process". For a 9Dan winning game,
> > > those trees surely are worth investigating...
> > >
> > >
> > > On 5/16/07, George Dahl <george.dahl at gmail.com> wrote:
> > > >
> > > > I find Monte-Carlo Go a fascinating avenue of research, but what pains
> > > > me is that a huge number of simulations are performed each game and at
> > > > the end of the game the results are thrown out. So what I was
> > > > thinking is that perhaps the knowledge generated by the simulations
> > > > could be collapsed in some way.
> > > >
> > > > Suppose that epsilon greedy versions of a reasonably strong MC Go
> > > > program were to play a very large number of games against themselves.
> > > > By epsilon greedy versions I mean that with probability epsilon a
> > > > random move is played and with probability 1- epsilon the move the MC
> > > > Player would normally play is played. Each position in these games
> > > > would be stored along with the Monte Calro/UCT evaluation for that
> > > > position's desirability. This would produce an arbitrarily large
> > > > database of position/score pairs. At this point a general function
> > > > approximator / learning algorithm (such as a neural network) could be
> > > > trained to map positions to scores. If this was successful, it would
> > > > produce something that could very quickly (even a large neural net
> > > > evaluation or what have you would be much faster than doing a large
> > > > number of MC playouts) map positions to scores. Obviously the scores
> > > > would not be perfect since the monte carlo program did not play
> > > > anywhere near perfect Go. But this static evaluator could then be
> > > > plugged back into the monte carlo player and used to bias the random
> > > > playouts. Wouldn't it be useful to be able to quickly estimate the MC
> > > > score without doing any playouts?
> > > >
> > > > Clearly this idea could be extended recursively with a lot of offline
> > > > training. What makes this formulation more valuable is that given
> > > > enough time and effort someone familiar with machine learning should
> > > > be able to produce a learning architecture that can actually learn the
> > > > MC scores. It would be a straightforward, if potentially quite
> > > > difficult, supervised learning task with effectively unlimited data
> > > > since more could be generated at will. Such a learning architecture
> > > > could be used in the manner I described above or thrown at the more
> > > > general reinforcement learning problem.
> > > >
> > > >
> > > > Does anyone have any thoughts on this idea? Does anyone know of it
> > > > being tried before?
> > > >
> > > > - George
> > > > _______________________________________________
> > > > computer-go mailing list
> > > > computer-go at computer-go.org
> > > >
> http://www.computer-go.org/mailman/listinfo/computer-go/
> > > >
> > >
> > >
> > > _______________________________________________
> > > computer-go mailing list
> > > computer-go at computer-go.org
> > >
> http://www.computer-go.org/mailman/listinfo/computer-go/
> > >
> > _______________________________________________
> > computer-go mailing list
> > computer-go at computer-go.org
> > http://www.computer-go.org/mailman/listinfo/computer-go/
> >
>
>
> _______________________________________________
> computer-go mailing list
> computer-go at computer-go.org
> http://www.computer-go.org/mailman/listinfo/computer-go/
>
More information about the computer-go
mailing list