[computer-go] Improvement of UCT search algorithm

Łukasz Lew lukasz.lew at gmail.com
Sun Oct 8 06:37:42 PDT 2006


I would like to add that in my experiments with a small number of
playouts ~30000, without all as first heuristics, the common situation
was that a bot explored the tree more or less properly, but in the end
it finished with a wrong move, because last refutation found didn't
manage to make a significant change to the root decision.

How do You deal with this problem?
How do You choose the final move?

Lukasz Lew

On 10/5/06, Don Dailey <drd at mit.edu> wrote:
> Łukasz,
>
> I have used update rules like you propose and could not find a
> significant difference in strength.   That's doesn't mean it doesn't
> work, it just means that I failed to make the general idea work.
>
> One thing that I did do which seems to be a small improvement is a very
> similar idea, probably an almost equivalent idea - instead of updating
> according to your rule I degrade the samples by a tiny percentage.   I
> do this whenever I update a new node, I degrade all the siblings at the
> same time.  Essentially I multiply the samples and the results by a
> constant very close to 1.0 such as 0.9999999
>
> This has the same effect as your suggestion, it causes later simulations
> to have more impact than earlier ones, which is a desirable feature.
>
> But I'm not sure it matters.  I have noticed that when the program
> changes it's mind about which move to play,  the score of the parent
> node quickly converges to the score of the child move,  regardless of
> whether I'm using the degrade constant or not.   My hope is that the
> degrade constant just makes this happen a little faster and the net
> effect as it's passed up the tree is a search that is noticeably more
> responsive at finding good moves.
>
> I have found it's very difficult to test ideas like this.  I can fiddle
> around with constants and formula's till I'm blue, but there is no way
> to quickly determine if I've actually made an improvement.
>
> I run a few problem positions to see if I can get them to find the
> correct move more quickly - but the problems respond differently.  I can
> set things a certain way which makes one problem respond more quickly,
> but this hurts another problem.  With monte carlo you have to run the
> problems a large number of times in order to get reliable timings.
>
> Not only that, but I learned from my computer chess experience that
> problem solution times do not correlate well to playing strength.   Yes,
> there is a general correlation, but it is not strong enough to base
> playing strength decisions on.   And this effect is probably even more
> pronounced in GO than in chess,  making problem solution times only a
> vague estimate of playing strength.
>
> Having said that,  I have one interesting test that I do, which I take
> with a grain of salt, but I use as a first guess estimate.  I search
> from the opening position a few hundred times and average the time
> required to find the move "e5."    My assumption is that "e5" is the
> best move (the program always settles on it eventually if I let it think
> long enough.)
>
> The standard deviation is affected by different parameters too.  I
> haven't done a serious study yet, but there are some variations in
> parameters that find the key moves in problems (or 'e5' in the opening)
> very rapidly on occasion,  but also take very long on occasion.  I think
> it's likely that the best playing algorithms should minimize the
> standard deviation.   I am interested in looking to see if some
> parameter settings have a funny effect on the probability distribution,
> perhaps decreasing the mean solution time but increasing the worst case
> times.   Too much to consider and this can be extremely time-consuming.
> On top of all of that of course is that this is probably not highly
> correlated to playing strength.
>
> I have some empirical evidence that the parameters should change
> depending on the stage of the game too.  When will I get time to test
> that?
>
> If I had several hundred computers,  I could probably find 100 ELO
> points or more testing ideas in just a few days.
>
> - Don
>
>
>
> On Thu, 2006-10-05 at 14:04 +0200, Łukasz Lew wrote:
> > I guess MoGoxyz and Valkyria_UCTx are already doing it, but I would
> > like to propose one simple improvement for UCT.
> >
> > In each node UCT maintains (implicitly or explicitly) average of
> > playout results.
> > An equation for updating it explicitly could be:
> >
> > avg := (1 - a')*avg + a' * playout_result
> >
> > where a' is learning rate:
> >
> > a' = 1/n
> >
> > One may obtain better results if he fixes a' at some point instead of
> > allowing it to converge:
> >
> > 'a = max (epsilon, 1/n)
> >
> > Of course this way avg will not converge to any value, but it is the
> > way it should be because playout_results are not generated from
> > stationary distribution as in bandit problem.
> >
> > This is especially important when some good refutation is found late
> > in the search and n is large so the conventional averaging will take a
> > lot time to "notice" that the refutation is found.
> >
> > Of course there are better ways to deal with 'a, but firstly I would
> > like to hear Your opinion.
> >
> > BTW
> > Can we hear what are the enhancements of Valkyria_UCT6 / new Mogo?
> >
> > BTW2
> > CGOS pairing algorithm rarely pair strong programs against themselves,
> > especially when there is a bunch of very weak players urging to play
> > with them. Maybe we could shut down some of them (with rating < 800
> > and #games > 10000)?
> >
> > Lukasz Lew
> > _______________________________________________
> > computer-go mailing list
> > computer-go at computer-go.org
> > http://www.computer-go.org/mailman/listinfo/computer-go/
>
> _______________________________________________
> computer-go mailing list
> computer-go at computer-go.org
> http://www.computer-go.org/mailman/listinfo/computer-go/
>


More information about the computer-go mailing list