[Computer-go] GPU Go engine (re-sent with fixed quotes!!!)

Petr Baudis pasky at ucw.cz
Tue Jan 8 08:48:24 PST 2013


  Hi!

On Tue, Jan 08, 2013 at 04:34:15PM +0400, Alexander Kozlovsky wrote:
> Is it possible to completely ignore superko during playouts
> and detect it during UTC tree descent only, or maybe,
> detect only relatively short superko cycles during playouts?
> I think that probability of long superko cycle are very small
> and statistically its effect during long series of random
> playout will be insignificant.

  Short superko cycles might be enough, but this will need careful
debugging. (Hidden) superkos may matter surprisingly often in
endgame positions, at least on 9x9 in such a way that guarding
against superko in some way is essential for good performance.

>  > > 2) What is a typical data size of structures used in modern
>  > > Monte-Carlo Go engines?
>  > > - size of UCT tree node;
>  >
>  >   You will need at least UCT and RAVE values (#of playouts
>  > and value or #of wins), move coordinates and say another
>  > extra byte for some flags, per child, and pointer to its
>  > children list.
> 
> I have a plan to completely avoid pointers to child nodes.
> Maybe I don't fully understand something, but if each UCT
> tree node have unique Zobrist hash and all nodes reside
> in gigantic hash table, then address of hash table bucket
> with specific child node can be calculated on the fly by
> calculating Zobrist hash of child position from parent Zobrist
> hash. So, it is enough for each node to keep relatively small
> header of node-specific information and array of UTCValues
> of chid positions,

  This is usually called transposition tables. It has its pros and cons;
you will save on child pointers and allow merging transposed positions,
but you will get all the hash table worries, mainly fixed capacity.
There's no clear answer here which is better, I think. But you must
have good entropy in your hashes as full collisions are catastrophic.

> where index in the array corresponds to
> coordinates on the board.

  This seems somewhat wasteful, though; by middlegame, you will have
a significant overhead, and you will want to have a way to skip
unused positions during iteration, which will mean some more overhead.

> So, I think is is possible to keep size
> of UTC node less then 1KB for 19x19 board assuming each
> UTC value in array of child UTC values fit in 2 bytes
> (I don't sure is 2 bytes is enough for UTC value, but why not).

  With longer thinking times, even single precision floats aren't
enough. Careful here, typically many tree branches may have very similar
values.

> This assume some calculations during tree descent, but
> I think it is not critical, because I think most processor time
> will be spend not in tree descent, but in random playouts.

  I find that surprising amount of CPU time is spent in the tree descent
(due to cache misses and mathematic operations) in Pachi, but with
efficient representation and superscalar evaluation you should be able
to push this down.

> Also, I want to do not
> not one, but several playouts after each tree descent
> (I do not know whether this is a common practice), so
> tree walking time seems even more minor compared to
> playouts.

  This (or rather a variant of this) is sometimes called leaf
parallelisation. It's not been found to work well so far, but
I encourage you to do your own experiments.

> I want to store Zobrist hash values not in 6 KB of shared memory
> used for playout, but in 64 KB of GPU constant memory,
> which have 8 KB cache per SM and to me looks very suitable
> for Zobrist hash storage, because it seems optimized for both of

  Huh. Isn't the constant memory space shared with the shared memory?
(Or worse, offloaded at will to main memory.)

> By the way, I have read several time already about
> keeping edges in array together with positions to avoid
> analyze of position index to determine is it lie near board,
> but I just don't understand this. Is it really so slow to check
> if position lie near board? As it seems to me, many board
> representation (such as bit masks) imply much more
> performance hit comparing to edge check. At the least,
> I can have a small shared lookup table, which will tell
> for each index is it lie near the edge, and near which
> edge exactly

  I'm not sure if anyone really uses bitmasks when there's no bad
memory pressure. I don't. I just talk about bits of information,
it's up to you if you round that up to nearest 2^n, n>2. :-)

  Having an edge frame allows us to avoid branching in many common
for-each-neighbor loops, for example when decreasing/increasing
liberty count, cached information like 3x3 pattern codes, and so on.
Branching is expensive. But again, I encourage you to experiment. :)

				Petr "Pasky" Baudis



More information about the Computer-go mailing list