[computer-go] Pay-as-you-go cluster?

Chrilly c.donninger at wavenet.at
Sun Oct 1 11:31:59 PDT 2006


Hydra is a distributed cluster consisting of 32 Dual-Nodes, which are 
connected by a Myrnet-network.  We use MPI (Message Passing Interface) for 
the paralleziation of the code. MPI is relative straightforward to use. I 
think the question is not shared memory or distributed cluster. The question 
is if the algorithm is scalable. In case of Hydra a shared memory 
64-processor cluster would not help much. The main problem is starvation. In 
Alpha-Beta one uses the so called Young-Brothers-Wait concept. The first 
move in each node must be computed before the siblings/brothers are 
distributed. Without Young-Brothers one solves the starvation problem, but 
gets a massive parallel overhead. Moves are searched in parallel which would 
have never been considered on a single processor. The Young-Brothers concept 
introduces a non-parallelizable part in the search. Amdahls law puts an 
upper bound to the parallel speedup. MPI makes it a little bit worse, 
because in Hydra the minimum height of a subtree must be 7 Plies. 
Distributing a smaller subtree does not pay. It takes longer to distribute 
and collect the result than to compute the subtree locally. Another inherent 
problem of computer chess is, that the number of nodes in e.g. a 10 Ply 
subtree can differ by a factor of 10**4.  It happens frequently that result 
of all the distributed siblings are long available, but 1 move is pending. 
In this case all the siblings have to wait for the result. This problem is 
somewhat solved by dynamic work-balancing. The processor with the remaining 
subtree can split up the tree to the waiting siblings or the processors 
working on the siblings can search in other parts of the tree for new work. 
But it is still a bottleneck.

As said already above: The question is not shared or distributed memory. The 
question is the structure of the algorithm.

For massive parallel systems there exists also no real shared memory. 
Massive parallel systems with shared memory are NUMA (Non-Uniform 
Memory-Access) Machines. The shared memory is on the physical level a 
network similar to the Hydra Cluster. The shared memory is an illusion for 
the the programmer. This makes things usually worse, because the programmer 
is not aware of the hidden costs. A memory access to a remote node can be by 
a factor of 10**3 more expensive than a accessing local memory. In case of 
MPI one knows that message passing is an expensive operation. E.g. In Hydra 
Message Passing is only done, when the CPU waits on the PCI-Bus on the 
result of the FPGA-chess card. In this case message passing is almost free. 
Only the latency puts an restriction on the height of the subtree.

Chrilly 



More information about the computer-go mailing list