[computer-go] Experiments with UCT
Rémi Coulom
Remi.Coulom at univ-lille3.fr
Wed Jul 26 01:19:10 PDT 2006
Yizao Wang wrote:
> The test we tried are different from Remi's. (His is interesting: '170 games are played against GNU Go 3.6 at level 10, from 85 different starting positions', does it say that there are games beginning from the middle of 'one game'?) We set MoGo play against gnugo3.6 with default level, each version of MoGo(or with different parameters) plays at least 100 games. Instead of giving the total time as a parameter, we set a certain time for each move when MoGo is thinking.
My starting positions have only one white stone and one black stone. I
used them only to ensure variety in the games. I used to run matches
from the empty board before, counting on the randomness of Crazy Stone
and GNU Go to avoid repeating games. All the games were different, but I
found that some sequences of moves were very frequent at the beginning.
I believe that starting from random positions ensures more variety and
may provides more significant data.
>
> We get also the results showing the improvement of level with more time given. Some results for MoGo 1.0.0, the version with which it played during the 16th KGS computer go tournament. (This version is very slow and obviously he is much weaker than CrazyStone using UCT.)
>
> 5sec around 30%
> 20sec around 40%
> 60sec around 50%
>
> However, with 120sec for each move, we have not seen any improvement. I do not know if Remi has tried some experiments with more times.
At 32 minutes per game, Crazy Stone takes about 2 minutes per move for
the first moves. I have also run mathes at 128 minutes per game before
Turin, and noticed that it improved over games in 32 minutes, but that
was not with UCT.
A point to take into consideration is that, with 100 games, uncertainty
margins are rather wide. Even with my 170 games it is not rare to obtain
an inferior result when doubling the thinking time. That is noise.
>
> Three conjectures are discussed:
> --120sec is not long enough to see another improvement of 60sec's behavior. Maybe... 5minutes?
> --Bugs in MoGo.
> --UCT's application to Go has some problems.
UCT is extremely selective, and that may be a problem. sqrt(log(t)),
which is the term that forces to select moves with a bad evaluation from
time to time, grows extremely slowly. The n-armed bandit algorithm UCT
is based on was designed for problems having stationary distributions,
which is not the case when searching recursively.
This being said, I found that the extreme selectivity of UCT works
rather nicely. It helps to search some lines of play very deeply.
By the way, I have though about the following idea: the log(t) term in
the square root is not even necessary to ensure convergence to the
optimal play. That's because we are using binary values and Go is
deterministic. A move is either a win or a loss. So in case the move
that is currently considered best is a loss, even if its estimated value
is so high that other moves are not searched at all, its value will
eventually go down, and a winning move will be searched, if there is
one. I found that this actually happens when observing games of Crazy
Stone. So, being extremely selective may be good.
It was not the case in my CG2006 paper, when I was using territory as
the evaluation. Using probability of winning instead of territory
completely changes the nature of the search. I believe that the
selectivity and backup operators I was using then were good for a
territory-based evaluation, but they are not any more with a
probability-of-winning-based evaluation, that allows a much higher
selectivity.
>
> The last one is interesting, since the algorithm of MoGo is slightly different from UCT at obtaining the final score(by random simulation), which might give a big bias that could be amplified during the update process in the tree from the deep level to the root.
Do you mean that you use a backup operator a bit like the one I
described in my CG2006 paper? My experiments with applying such an
operator have not been very sucessful so far, but I am sure it can be done.
I think an important aspect of designing a backup operator, that I had a
little overlooked at the time of my CG2006 paper, is that it is much
worse to over-estimate the value of a position than to under-estimate
it. Backing-up the average under-estimates the position, but it is not
so bad. Under-estimating the position means that the move that leads to
this position will be over-estimated, which means it will be searched
more, and the over-estimation will be eventually solved by searching
more. Over-estimating a position means that the move that leads to the
position is underestimated, so it may be searched less (almost not at
all with UCT), so the over-estimation may never be corrected, which may
lead to catastrophes.
Don's idea of forgetting early simulations seems good, and may work with
UCT to improve the backed up value. I will try it.
Rémi
More information about the computer-go
mailing list