[computer-go] Monte Carlo combined with minimax search
Don Dailey
drd at mit.edu
Sat Jul 22 07:50:58 PDT 2006
The current version of Lazarus, I believe has evolved to something very
similar to Ant Colony Optimization. I'm of course struggling with the
same problems everyone doing this kind of thing is dealing with - how to
effectively mini-max values up the tree when low sampling rates give
unreliable scoring information.
I'm using something like "pheromone evaporation" to filter out the noise
of irrelevant moves but in a gradual way, as it's discovered that the
move is actually irrelevant. The behavior is that effectively all
moves get sampled initially, and then it gradually focuses on the best
move.
It appears to be working quite well despite the fact that I have not
been able to tune it very well yet and do not have a very fast computer.
There are 3 or 4 parameters that influence how strong the program plays
and I don't yet have a good sense of how to set them. But I did
construct a few problems and spent a little time experimenting with the
parameters.
I proved empirically that the program effectively mini-max's a tree
although this is not explicitly coded. I have a basic problem where the
best move looks like the worst move to the program initially, until key
moves are found deeper in the tree.
It's fairly impressive to watch it discover this move, then watch the
score climb rather quickly as it focuses on this move. Previous
versions of Lazarus behaved differently when discovering the right
move.
There is another test I do, would like to get peoples opinion on this:
For 9x9 Go, I notice that from the opening position, E5 is a heavy
favorite. Any of my scalable programs will ALWAYS play E5 if I wait
long enough. Even AnchorMan would play E5 and hold it given enough
simulations.
So my assumption is that E5 is the best move. I experimented by running
the opening position to see how long it takes to find E5 given various
settings (of course I do many trials and take the average.) Certain
parameter combinations make this happen much quicker than others. In
the runs I require E5 to "stick" as the best move because sometimes it
will oscillate between a few good moves.
I view this as a "seat of the pants" feel for parameters that I think
are relatively good. I then test with a couple of interesting problem
positions to see if they are sensitive to the parameters.
I am curious about how the other programmers (who have to struggle with
parameter settings) deal with this. Especially the ones doing
Monte/Carlo with tree's.
One other observation - pure alpha/beta (where the evaluation function
is a monte/carlo simulation) isn't all that bad, but it doesn't seem to
be quite as good as other ways of building tree's. It's seems like a
huge waste not to be able to get beta cutoff's, a huge tree saving
device.
There must be some way to salvage this situation but I don't see a way.
The techniques we are now talking about use the information that beta
cutoff's throw away.
- Don
On Sat, 2006-07-22 at 15:53 +0200, Chaslot G (MICC) wrote:
> > It sounds like there are two major details to address: how are
> values backed up the tree and how do you choose a move (balancing
> exploration and exploitation)?
>
>
>
> Hi Peter,
>
>
>
> In this context, I propose one algorithm for move-selection and for
> back-up operation. It can be found in this paper:
>
> http://www.cs.unimaas.nl/g.chaslot/mcscg.pdf
>
>
>
> Back up operation:
>
>
>
> Remi proposed a rule that makes some transition from the mean value to
> the max. In this rule the weight of the mean only depends on the
> number of simulations made.
>
> Indeed, it is clear that in the beginning the backed-up value should
> be the mean, and that after a certain time, it should be the max
> value.
>
> However, the number of simulations made is not always relevant to
> build the transition from one to the other.
>
> For instance, as soon as a move is proved to be better than all the
> other moves, the back-up rule should be equivalent to a max, whatever
> the number of simulations made.
>
> Another example is that if there are some very bad moves, they will
> low down the mean value. Then it would be better to compute a “mean”
> which considers only the good moves.
>
> The algorithm that I propose realizes a transition between the mean
> and the max, which takes into account the number of games played, and
> also a “confidence” value for each move. In particular, the rule I
> propose deals well with the two examples above.
>
>
>
> Move-Selection:
>
> The selection algorithm proposed seems to be more exploitative than
> Remi’s, but more explorative than UCT. I will do some comparisons
> soon, on the same kind of test bed proposed in the paper, but for a
> deeper depth.
>
> Given a selection algorithm, one can easily make it more or less
> exploitative, for instance by using a softmax on top of it (several
> other methods are possible). But until which point this method would
> enhance the level of the program?
>
>
>
> I am currently working on the integration of domain-dependant
> knowledge into the Monte-Carlo games (patterns mostly, and also some
> Atari heuristic like these proposed by Remi). After that I will do
> some experiments on increasing or decreasing the
> exploration-exploitation of different move-selection algorithms. I
> will also be working on improving the ”confidence value” used by the
> back up strategy.
>
>
>
> Best,
>
>
>
> Guillaume
>
>
>
>
>
> -----Original Message-----
> From: computer-go-bounces at computer-go.org
> [mailto:computer-go-bounces at computer-go.org] On Behalf Of Peter Drake
> Sent: Tuesday, July 18, 2006 8:55 PM
> To: computer-go
> Subject: Re: [computer-go] Monte Carlo combined with minimax search
>
>
>
> It sounds like there are two major details to address: how are values
> backed up the tree and how do you choose a move (balancing exploration
> and exploitation)?
>
> As Coulom notes, a reasonable answer to the first question is to take
> the average of the mean and the max of the node’s children. (Mean
> alone ignores the intelligent choice to be made, while max alone
> chooses the luckiest move.) If more and more runs are spent on the
> best move, this approaches max as confidence improves.
>
> The second problem is a variant of the k-armed bandit problem. Several
> relatively simple approaches exist, including softmax. There is a
> question of what value to give unexplored branches—perhaps the average
> of the explored branches?
>
> Peter Drake
>
>
> Assistant Professor of Computer Science
>
>
> Lewis & Clark College
>
>
> http://www.lclark.edu/~drake/
>
>
>
>
>
> _______________________________________________
> computer-go mailing list
> computer-go at computer-go.org
> http://www.computer-go.org/mailman/listinfo/computer-go/
More information about the computer-go
mailing list