[Computer-go] Fuego 04 native Windows
René van de Veerdonk
rene.vandeveerdonk at gmail.com
Wed Jun 16 17:09:17 PDT 2010
Thanks for your answer, it makes much more sense to me now.
We are using pipelining in different ways. When I referred to it for a
CPU-based single-threaded application, I was thinking about speculative
execution. If I understand it correctly, that does not exist in FPGA's, as
these are advertised as deterministic in their execution and process flow.
In the FPGA case, I imagine that pipelining refers to "unrolling the
program", and having different boards physically move across the chip from
module to module, as if they are on a production line, all in various states
of simulation (board #11 at module #101: black to move; board #12@ module #100:
white to move; etc.).
How you have designed your program in detail would be an interesting read,
there are a lot of high-level design trade-offs that you must have dealt
with. These will be very different from how you would do it for a CPU-based
program. One difference that I imagine, for instance, is the length of the
simulation. A CPU-based program stops when the game ends (or you exceed some
limit, or you force an early decision, or ...), whereas for FPGA you may end
up with a fixed game-length (ready or not, i.e., no early out option) and
you may have to simulate pass moves until you reach the end of the
"production line" in case the game ended early (is this what you do?). In
any case, your impressive numbers suggest that this can be done very
efficiently. How you harness all this raw simulation power in a tree-search
is yet another research topic that is very interesting and almost
orthogonal. Do you think your approach could be mapped to a GPU as well? In
any case, I hope you will make a pre-print available to this list when the
time is there.
In another response in this thread, you mention that you are simulating 167
board in parallel. Does that mean that you unrolled your program for 167
moves, moving a board between 167 separate modules every "cycle" and
seed/harvest one complete board per "cycle"? Or do you have multiple
(shorter) production lines in parallel? Or something else entirely?
As you may have noticed, I am looking forward to your paper,
On Tue, Jun 15, 2010 at 7:03 PM, Fuming Wang <fumingw85 at gmail.com> wrote:
> Hi Rene,
> Our design is fully pipelined, so we are able to simulate multiple games
> simultaneously. The way way in which simulations are run in FPGA and in CPU
> is quite different, so direct comparison is not easy. If we want to simulate
> just one game, FPGA implementation is not 10x faster, however, if we want
> thousands of games simulated for a single board position, than FPGA is 10x
> faster. So, we are getting 1500k GAMES/sec, but only in the second sense.
> The clock rate of our FPGA board is only 125 MHz, so with better board/chip,
> we will still have 10-100 times improvement over the current result.
> On Wed, Jun 16, 2010 at 1:28 AM, René van de Veerdonk <
> rene.vandeveerdonk at gmail.com> wrote:
>> Could you please explain your approach a little bit? From the numbers you
>> quote, this sounds extreme positive, but I have a hard time understanding
>> how you achieve them. Taking 100k playouts/sec for 9x9 on my 2.4 GHz labtop
>> for my single-threaded bitmap based light-playout implementation as an
>> example, with 110 moves/playout, this results in a little less than 240
>> clock-cycle/move. When I quickly looked up the Cyclone III specification, I
>> saw that the clock-speed for this FPGA tops out around 240 MHz, yet you
>> achieve 15x the throughput, i.e., you are 150x more efficient. This means
>> 1.8 clock-cycle/move. Without being able to make use of pipe-lining inside
>> the CPU (someone measured ~2 assembly instructions/clock-cycle for my bitmap
>> approach), this leads me to questions. First, are you running a single
>> threaded application, or playing on multiple boards at once? Second, are you
>> just replaying moves, or also generating them on the fly (about half of the
>> time is spend there in my implementation, more if you include updating the
>> data-structure to make that possible)? Third, are we using the same
>> For instance, I would find it much more comprehensible to believe that you
>> achieved to do 1500k moves/second instead of 1500k playouts/sec (with each
>> playout being ~110 moves). 200 clock-cycles/move sounds do-able if you can
>> avoid branching, memory lookups, or miscellaneous calculations by creating
>> fine-level parallelism in your FPGA-code and specializing functions on a per
>> grid-point basis. In a CPU-based application, this results in code-bloat
>> that will become counter-productive at some stage, may not be feasible in
>> all instances, and is more difficult to maintain. For an FPGA-based
>> application, however, this sounds entirely possible (not knowing anything
>> about FPGA's).
>> René van de Veerdonk
>> On Sat, Jun 12, 2010 at 10:37 AM, Fuming Wang <fumingw85 at gmail.com>wrote:
>>> Cyclone III
>>> 120,000 logical elements
>>> cycle time is linear to the number of moves to finish a game, which is
>>> approximately linear to the square of the board size.
>>>> - What FPGA? Virtex-6? Spartan-6?
>>>> - What size is the core in LUT's?
>>>> - Is your cycle time linear in the board size or in the number of
>>>> squares (i.e. quadratic in board size)? Or something else?
>>>> Computer-go mailing list
>>>> Computer-go at dvandva.org
>>> Computer-go mailing list
>>> Computer-go at dvandva.org
>> Computer-go mailing list
>> Computer-go at dvandva.org
> Computer-go mailing list
> Computer-go at dvandva.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Computer-go