[Computer-go] GPU Go engine

Alexander Kozlovsky alexander.kozlovsky at gmail.com
Tue Jan 8 04:05:26 PST 2013

On Tue, Jan 8, 2013 at 4:12 AM, Petr Baudis <pasky at ucw.cz> wrote:

>   Hi!

Thanks for response! Some more questions,
I marked them with @@@ below

> - size of board representation;

>   The bare minimum of course is two bits per point. Usually,
> an extra edge frame (or half-frame) is added so that you
> do not have to special-case border areas.
>   Then again, for efficient implementation, you will also
> need at least some representation of stone chains and
> chain membership information for each point.

Bit masks seems slow to me, I came up with a very effective
board representation, which fit all board information into three
arrays of integer indexes, plus several fields for current player
color, ko status, etc. I cannot tell full details right now, but
this arrays keeps some intricate linked lists, and allows
to store information about each board intersection, and also
list of stone chains together with count of real liberties
for each chain. This board representation allows very fast
modification during random playouts (because all information
is stored in linked lists), and is relatively small (256 bytes
for 9x9, 512 bytes for 13x13 and around 2200 byte for 19x19,
because 19x19 have more then 256 positions and hence
16-bit offsets are keeped inside arrays instead of 8-bit offsets).
Also, this representation allows pre-sorting of array of free
positions according to some RAVE information and then
random-select "useful" moves during playout more often
then "not useful".

I'll tell details of this representation If I don't manage to write
functional Go engine in close time. Probably compactness
of this structure is not so important when CPU is used,
but I like it may be updated very fast.

Then, you'll need to guard against superko by having

a hash code per point+color and hash map of encountered



Is it possible to completely ignore superko during playouts
and detect it during UTC tree descent only, or maybe,
detect only relatively short superko cycles during playouts?
I think that probability of long superko cycle during
random game are very small and statistically its effect
during long series of random playout will be insignificant.

> - size of patterns library used for playout;
>   A bare minimum for good computer player is 3x3 patterns
> around last move, two bits per point, i.e. 16 bits per pattern.
> (Of course, you can use more bits or slightly larger patterns,
> or use decision tree instead of pattern matching.) You can
> extend this to much larger patterns, more moves, more features
> and so on, but that's not so crucial. You'd have about

8-16 such patterns (up to isomorphism).

I suspect strength of random playout may greatly depend on
good patterns, and so my initial plan is in using 3x3 and 5x5
patterns where each pattern position can have one of eleven
Zobrist hash value: empty, border, ko, or enemy/friend stones
with 1, 2, 3, >3 liberties

I plan to squeeze 100000 patterns in 512KB the next way: patterns
are stored in hash table with bucket size 32 byte, each bucket
can contain up to eight 4-byte patterns. Each pattern have 40-bit
Zobrist hash (I hope it is enough). First 14 bit of Zobrist hash are
used to detect bucket in hash table, and then 32 byte of bucket
are readed and 8 stored patterns are checked. First 26 bits of each
pattern contains second part of 40-bit Zobrist hash, and the last
6 bit contain pattern description (is it useful pattern or antipattern,
is it a ko-threat, last-game move, move which is useful during
semeai, should opponent tenuki after this move, etc.)

> 2) What is a typical data size of structures used in modern
> > Monte-Carlo Go engines?
> > - size of UCT tree node;
>   You will need at least UCT and RAVE values (#of playouts
> and value or #of wins), move coordinates and say another
> extra byte for some flags, per child, and pointer to its

children list.

I have a plan to completely avoid pointers to child nodes.
Maybe I don't fully understand something, but if each UCT
tree node have unique Zobrist hash and all nodes reside
in gigantic hash table, then address of hash table bucket
with specific child node can be calculated on the fly by
calculating Zobrist hash of child position from parent Zobrist
hash. So, it is enough for each node to keep relatively small
header of node-specific information and array of UTCValues
of chid positions, where index in the array corresponds to
coordinates on the board. During tree descent I can select
max UTC value of child node from array, restore child move
coordinates from position in array, calculate Zobrist hash
of child position and jump to appropriate bucket of hash table
(using first part of calculated Zobrist hash as bucket number),
then search node inside bucket using second part of hash.
Array of UTC values will allow me to minimize touching of
main memory during descent by selecting single child with
maximum UTC value. So, I think is is possible to keep size
of UTC node less then 1KB for 19x19 board assuming each
UTC value in array of child UTC values fit in 2 bytes
(I don't sure is 2 bytes is enough for rounded UTC value,
but why not. Exact UTC value will be stored in child node itself).
This assume some calculations during tree descent, but
I think it is not critical, because I think most processor time
will be spend not in tree descent, but in random playouts.
While one warp will wait child UTC node from the main memory,
other warps (of total 8) will use the same streaming multiprocessor
for playouts.

> Each of playout will use single warp (32 GPU thread) and typically
> > will use only first thread of warp, but for several task, such as
> Zobrist hash calculation, pattern matching and relatively good
> next random playout move selection, all warp threads
> > will be used.
>   You should also be able to fully utilize all threads for tree node
> selection. On the other hand, I worry a little about memory latency
> when walking the tree.

Since my UTC node will have array of child's UTCValues,
I'll plan to use full warp of thread to fast array search.
Regarding parallel access to child nodes, I want to
touch main GPU memory as little as possible both because
it is slow (100x of shared memory as I hear) and because
I don't want L2 cache degradation, so I'll plan to touch
only one child node at each descent step.

As I write above, I don't sure full utilization of threads during
tree descent is important, because tree descent seems mostly
bounded by slow main memory access time rather then
calculation speed, and also seems relatively small comparing
to the time of heavy random playout. Also, I want to do not
not one, but several playouts after each tree descent
(I do not know whether this is a common practice), so
tree walking time seems even more minor compared to

The trouble here is the superko detection I think, you
would need some clever ways to cheat here. If you
> don't care about 9x9 strength but just about large boards

Again, is it really so important to detect long superko cycles
during playout on 9x9?

> > I suspect light playout for 13x13 can fit in 1 kilobyte of memory,
>   Zobrist hashes alone would occupy 2352 bytes: (14*14)*3*4. Maybe
> 2028 if you forego the edge frame (but you'll have to do expensive
> index recomputation).

I want to store Zobrist hash values not in 6 KB of shared memory
used for playout, but in 64 KB of GPU constant memory,
which have 8 KB cache per SM and to me looks very suitable
for Zobrist hash storage, because it seems optimized for both of
potential use cases:

1) When I random-selected 32 candidate moves for playout,
and the want to check pattern matching for each candidate,
each warp thread will simultaneously calculate Zobrist hash
for pattern corresponded to this thread's candidate move,
and at each step all threads will request the same hash
value from the constant memory (say "Zobrist-hash for
upper-left corner of pattern if this place is empty").

2) When calculating Zobrist hash of whole position, different
threads in warps will request hash values from constant memory
which are often aligned near each other

I don't examine full details of GPU constant memory yet,
but from what I know it will be very efficient for such task,
and precious 6KB of shared memory can be dedicated
solely for board representation and some local variables


By the way, I have read several times already about
keeping edges inside array together with positions to avoid
analyze of position index to determine if it is lie near board,
but I just don't understand this. Is it really so slow to check
if position lie near board? As it seems to me, many board
representation (such as bit masks) imply much more
performance hit comparing to edge check. At the least,
I can have a small shared lookup table, which will tell
for each index is it lie near the edge, and near which
edge exactly

>   And this is 13x13. I think engines that can't work on 19x19 aren't
> really very exciting nowadays, maybe that's just me.

My very efficient (as I hope) board representation takes
256 bytes for 9x9, 512 byte for 13x13, but whole enormous
2200 bytes for 19x19, so my first plan is to avoid 19x19
completely and only aim at domination for 9x9 and 13x13 ;)
(at least for the first time, to comfortably reside in 6KB of
available memory)

  You don't need much memory for ladder evaluation, I think;
> you need to store move stack only in case of branches,
> which are rare when checking ladders.


Even if branches are rare, I probably should implement branch
analysis at least 1-2 levels deep, or are you think branches are
so rare this is unnecessary? I pan to have some kind of stack
inside 6KB and just clone whole 512-byte board representation
(for 13x13) on the top of this "stack" when branch occured,
and then perform atari series to the end to see if ladder is successful.

> - Hash table with pattern library must fit in 512 kilobyte of GPU
> L2 cache (with total L2 cache size 768 kilobyte)
>   You will also want the cache to hold some popular parts of the tree,
> a "base" board with the current position, etc.

I'll have the rest 256 kilobytes of cache for this

> > I hope single UCT node will take around 512 byte for
> 13x13 board, so GPU memory may hold up to 2 million
> UCT nodes.

> I think this shouldn't be a severe strength limit; if you

implement some (not necessarily even very aggressive)

pruning and store your nodes in an efficient manner, you

shouldn't have trouble.

Glad to hear this, thanks much for the info

> If you ever get bored of GPU Go, I think the hardware future lies

in two directions:
>   (i) Xeon Phi (Larrabee). I'm so envious of any Go programmers

that get their hands on this beauty first.

I was very excite when I hear about Xeon Phi some time ago,
but I suspect it will be a bit pricey (at least at the first), and I
already have two GTX 580, so for a while I'll plan to play with it :)

Thanks again for very interesting comments!

> Computer-go mailing list
> Computer-go at dvandva.org
> http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://computer-go.org/pipermail/computer-go/attachments/20130108/bac1233f/attachment.html>

More information about the Computer-go mailing list