[computer-go] An attempt to explain UCT for computer Go
Jeremy Mange
jeremy.mange at gmail.com
Thu Sep 21 11:06:06 PDT 2006
I saw this in Rémi Coulom's paper too; is there a reason you (and he) write:
-----Original Message-----
From: computer-go-bounces at computer-go.org
[mailto:computer-go-bounces at computer-go.org] On Behalf Of Magnus Persson
Sent: Thursday, September 21, 2006 1:52 PM
To: computer-go at computer-go.org
Subject: [computer-go] An attempt to explain UCT for computer Go
Quoting Arthur Cater <arthur.cater at ucd.ie>:
> What is "UCT-search", where can I find out about it?
You should also read this post to this list:
http://computer-go.org/pipermail/computer-go/2006-July/005737.html
This is a very short description of how Crazystone uses UCT. The paper
others
already mentioned presents UCT applied in a more general way without
references
to go.
The author Rémi Coulom discovered that you have to change a constant,
because in
game tree search the winrates measured at each node oscillate a lot,
which means
that one must explore alternatives much more than for applications
where you do
not have to take an opponent into account (who actively tries to lower the
winrate).
It might be a little hard to follow so here is an attempt to present UCT in
a
more programmer friendly fashion and less mathematical. Thus I do not follow
the exact terminology of those others, and I might in fact have
misunderstood
something, but nethertheless it works).
Basically the starting point is a program that plays random games and
evaluates
the position where no more moves can be played and it searches for moves
that
has the highest chance of winning. To do so it builds a search tree in
memory
by expanding it with one node for every simulation played. In each node the
Winrate is updated (Wins/Visits). A naive way of searching would be to
select
the node with the highest Winrate (a practical problem is to keep track of
black and white in alternating nodes amd assignes Wins and Losses
correctly).
This is what UCT does most of the time, but it also adds a number to the
Winrate that goes down everytime the node has been visited. The number goes
up
a little every time the parent was visited and this means that at some point
the WinRate + the value will become higher than for the best node how
will have
a value that goes down everytime it is visited.
So here is how I calculate the values to pick the next node/move in the tree
(moves are only random after the tree has been expanded with a new node.) If
there are children to a node that has not been visited then one of those are
picked at random and no UCTvalues has to be computed.
The UCTvalue for each candidate move is computed with
Winrate + sqrt((2*ln(ParentVisits))/(10*NodeVisits))
and the candidate move with highest UCTvalue is selected to be played next.
With UCT one can strike a good balance between searching the best move so
far
and exploring alternative moves. I think the innovation of UCT (to me
at least)
is the logarithm of the number of visits to the parent node for the
UCT-Value.
This value goes up quickly early but then it becomes flat, but never stops
rising. It goes to infinity although very very slowly. All moves will be
searched no matter how bad winrates they have, so that UCT-search given
close
to infinite time and memory will find any winning no matter how misleading
the
winrates are initially.
Note that I wrote "a winning move". It will not pick the "best" move
becuase the
algorithm does not backup score in a traditional way and there is no way of
defining what best here would be. It rather picks a move that has a high
winrate early in search and cannot be refuted as being a losing move. This
is
very evident on 7x7 where it often searches very deep in winning
positions, but
the winning sequence is often a little odd to me but still most of time good
enough to win with 0.5 points. The opponents losing move however are
searched
uniformly so the search approaches the ideal "1 x W x 1 x W x 1 x W x ..."
search tree.
My first searching MC-program searched in a completely opposite way. It
aggressively pruned bad moves in the tree and computed new Min/Max Winrates
based on what was left in the tree. As a result the Winrates at each node
oscillated so bad that sometimes the best move was pruned just because by
chance all other candidates happened to peak at the same time as the best
move
was at the bottom and my attempts to fix that was really awful.
UCT search is different. Nothing is forgotten so the winrates change slowly
as
soon as the number of simulations has grown large.
-Magnus
_______________________________________________
computer-go mailing list
computer-go at computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/
More information about the computer-go
mailing list