[computer-go] Improvement of UCT search algorithm
Don Dailey
drd at mit.edu
Thu Oct 5 07:32:26 PDT 2006
Łukasz,
I have used update rules like you propose and could not find a
significant difference in strength. That's doesn't mean it doesn't
work, it just means that I failed to make the general idea work.
One thing that I did do which seems to be a small improvement is a very
similar idea, probably an almost equivalent idea - instead of updating
according to your rule I degrade the samples by a tiny percentage. I
do this whenever I update a new node, I degrade all the siblings at the
same time. Essentially I multiply the samples and the results by a
constant very close to 1.0 such as 0.9999999
This has the same effect as your suggestion, it causes later simulations
to have more impact than earlier ones, which is a desirable feature.
But I'm not sure it matters. I have noticed that when the program
changes it's mind about which move to play, the score of the parent
node quickly converges to the score of the child move, regardless of
whether I'm using the degrade constant or not. My hope is that the
degrade constant just makes this happen a little faster and the net
effect as it's passed up the tree is a search that is noticeably more
responsive at finding good moves.
I have found it's very difficult to test ideas like this. I can fiddle
around with constants and formula's till I'm blue, but there is no way
to quickly determine if I've actually made an improvement.
I run a few problem positions to see if I can get them to find the
correct move more quickly - but the problems respond differently. I can
set things a certain way which makes one problem respond more quickly,
but this hurts another problem. With monte carlo you have to run the
problems a large number of times in order to get reliable timings.
Not only that, but I learned from my computer chess experience that
problem solution times do not correlate well to playing strength. Yes,
there is a general correlation, but it is not strong enough to base
playing strength decisions on. And this effect is probably even more
pronounced in GO than in chess, making problem solution times only a
vague estimate of playing strength.
Having said that, I have one interesting test that I do, which I take
with a grain of salt, but I use as a first guess estimate. I search
from the opening position a few hundred times and average the time
required to find the move "e5." My assumption is that "e5" is the
best move (the program always settles on it eventually if I let it think
long enough.)
The standard deviation is affected by different parameters too. I
haven't done a serious study yet, but there are some variations in
parameters that find the key moves in problems (or 'e5' in the opening)
very rapidly on occasion, but also take very long on occasion. I think
it's likely that the best playing algorithms should minimize the
standard deviation. I am interested in looking to see if some
parameter settings have a funny effect on the probability distribution,
perhaps decreasing the mean solution time but increasing the worst case
times. Too much to consider and this can be extremely time-consuming.
On top of all of that of course is that this is probably not highly
correlated to playing strength.
I have some empirical evidence that the parameters should change
depending on the stage of the game too. When will I get time to test
that?
If I had several hundred computers, I could probably find 100 ELO
points or more testing ideas in just a few days.
- Don
On Thu, 2006-10-05 at 14:04 +0200, Łukasz Lew wrote:
> I guess MoGoxyz and Valkyria_UCTx are already doing it, but I would
> like to propose one simple improvement for UCT.
>
> In each node UCT maintains (implicitly or explicitly) average of
> playout results.
> An equation for updating it explicitly could be:
>
> avg := (1 - a')*avg + a' * playout_result
>
> where a' is learning rate:
>
> a' = 1/n
>
> One may obtain better results if he fixes a' at some point instead of
> allowing it to converge:
>
> 'a = max (epsilon, 1/n)
>
> Of course this way avg will not converge to any value, but it is the
> way it should be because playout_results are not generated from
> stationary distribution as in bandit problem.
>
> This is especially important when some good refutation is found late
> in the search and n is large so the conventional averaging will take a
> lot time to "notice" that the refutation is found.
>
> Of course there are better ways to deal with 'a, but firstly I would
> like to hear Your opinion.
>
> BTW
> Can we hear what are the enhancements of Valkyria_UCT6 / new Mogo?
>
> BTW2
> CGOS pairing algorithm rarely pair strong programs against themselves,
> especially when there is a bunch of very weak players urging to play
> with them. Maybe we could shut down some of them (with rating < 800
> and #games > 10000)?
>
> Lukasz Lew
> _______________________________________________
> computer-go mailing list
> computer-go at computer-go.org
> http://www.computer-go.org/mailman/listinfo/computer-go/
More information about the computer-go
mailing list