[computer-go] Lockless hash table and other parallel search ideas

Rémi Coulom Remi.Coulom at univ-lille3.fr
Fri Mar 21 14:49:21 PDT 2008


Hi,

I have got a lockless hash table to work, and I thought I'd share the 
results.

A lockless hash table is very important, because the usual approach that 
consists in using a global lock during tree search and update does not 
scale well, especially on 9x9. But it is possible to create a completely 
lockless hash table data structure that works with multiple threads.

Here are some links that give some indications of how such a thing can 
be done:
http://video.google.com/videoplay?docid=2139967204534450862
http://blogs.azulsystems.com/cliff/2007/03/a_nonblocking_h.html
http://www.cs.rug.nl/~wim/mechver/hashtable/index.html

Some of those links may look intimidating, but that is because the 
resize part of the algorithm is complicated. In my implementation, I do 
not resize the table, so it is very simple. Also, I update counter in 
each node with atomic increments and decrements (no need to lock).

Here is some preliminary experimental data for 9x9 up to 8 cores, 
running 840,000 playouts, from a tactical middle-game position:

(Cores / Playouts per second with spinlock / Playouts per second with 
lockless hash table)
1  22,477.9  22,447.9
2  37,769.8  43,076.9
3  55,888.2  60,825.5
4  61,448.4  79,470.2
5  64,665.1  95,346.2
6  62,407.1 110,092.0
7  66,508.3 130,638.0
8  59,196.6 146,341.0

BTW, using a pthread mutex is a lot worse than a spinlock, in my 
experience. I used the fair spinlock from the Linux kernel. But any 
implementation should work.

So, it is pretty cool. This was measured on only one run. Since it is 
not deterministic, performance may vary from one run to the other 
(especially since it does not always select the same best move). But it 
still clearly shows the superiority of the lockless hash table, and 
seems to indicate that it would still scale beyond 8 cores.

I believe I could improve further by reducing the number of atomic 
operations. Also, thinking about how to reduce atomic operations led me 
to imagine a scheme that may works as a distributed hash table over a 
network of PCs.

A simple scheme that would work on a single PC would consist in storing 
one counter per thread in the table. This way, it would not be necessary 
to use atomic operations to increment counters, and the cache coherency 
mechanism of the CPUs would handle sending data from core to core. The 
cost would be that in order to get the node counters, it would be 
necessary to add N values. Also, some information may arrive a little 
late to some threads (but I believe it is better to go run a playout 
rather than wait for data).

This scheme is a bit equivalent to using a separate hash table for each 
thread, and could be generalized to a distributed hash table over a 
network: each PC would have its own hash table, and each node of the 
tree would contain two counters: my_counter and other_counter. Every now 
and then, for instance when my_counter reaches a threshold, this PC 
would broadcast my_counter to the whole network, so that everybody 
updates other_counter.

I have not implemented this yet, but I will probably try it.

Right now, I will test the lockless hash table more, and will probably 
connect to 9x9 CGOS with that machine sometime during the week-end.

If you wish to implement your own lockless hash table, you should read 
Intel's documentation about its memory architecture. It can be found there:
http://www.intel.com/products/processor/manuals/index.htm
In particular, it is important to read "Architecture Memory Ordering 
White Paper", and about the lock prefix, the cmpxchg operation, sfence, 
lfence, and mfence.

The primitives I used in my algorithm are a store fence, atomic 
increment, atomic decrement, atomic compare and swap. If you understand 
what they do, you should be able to make your own lockless hash table.

Have fun,

Rémi


More information about the computer-go mailing list