[Computer-go] Evaluating improvements differently

terry mcintyre terrymcintyre at yahoo.com
Thu Apr 7 13:27:45 PDT 2011

To address Don's points, here are a few ideas:

One, pick situations where the choice of good moves is limited. These will 
probably be "all or nothing" battles.  If you pick move A, you win. Otherwise, 
you lose, tough luck.

Two, test for proper followup. It's not enough to determine A, then forget about 
the battle and play somewhere in another corner. If this is an all-or-nothing 
battle, the program must continue to play the proper response to every threat, 
until the job is done.

For extra credit, test what happens if the opponent continues to thrash after 
the situation becomes hopeless - the program should not waste a move in that 
area if it is not needed, but should take sente elsewhere.

In short, the program must master not only move A, but B, C, D, E, ... Z - and 
one should test different move orders for the opponent, including feints in 
other parts of the board. 

Having mastered a few such test cases, now introduce a ko or seki to make the 
analysis a bit more complex. Rinse and repeat. 

Terry McIntyre <terrymcintyre at yahoo.com>

Unix/Linux Systems Administration
Taking time to do it right saves having to do it twice.

From: Don Dailey <dailey.don at gmail.com>
To: computer-go at dvandva.org
Sent: Thu, April 7, 2011 3:33:36 PM
Subject: Re: [Computer-go] Evaluating improvements differently

On Thu, Apr 7, 2011 at 2:27 PM, Álvaro Begué <alvaro.begue at gmail.com> wrote:

>I haven't spent any time in go programming recently, but a few months
>ago I thought of a method to evaluate proposed improvements that might
>be much better than playing a gazillion games. A search results in two
>things: A move and a probability of winning (or a score that can be
>mapped into a probability of winning, but let's ignore that issue for
>now). Evaluating whether the moves picked by a strategy are good is
>really hard, but evaluating whether the estimate of a probability of
>winning is a good estimate seems much easier.
>For instance, take a database of games played by strong players.
>Extract a few positions from each game. Run your engine for some fixed
>amount of time on each position, and measure how well it predicted the
>winner after each position (cross entropy is probably the correct
>measure to use). Do this before and after the proposed modification
>and compare.
>Of course one has to be careful to pick reasonably well-played games
>(games played by top engines with more time per move than you'll use
>to evaluate your engine seems good enough, and will result in a much
>cleaner database than collecting games played by humans) and to have a
>large enough and varied enough set of positions. Also, one should
>worry about over-fitting for those particular positions, but one could
>use another set of positions for confirmation. These problems all seem
>manageable to me.
>It is possible that certain improvements can really only be measured
>by playing games (time control comes to mind), but I would imagine
>that for a large class of things this procedure can give you useful
>results in much more reasonable times.
>Your thoughts are appreciated.

I think this should be tried,  but I also believe you will be bitterly 
disappointed.   The holy grail for me is to find some way to quickly test 
versions and I have found that NO test does that better than playing games.     
The problem is that a real game is full of complex variables that interact in 
strange and wonderful ways.     

The reason a program wins or loses games is not as simple as just saying it 
played a bad move,  or it failed to see this or that .   It's rather more like 
the stock market.    Have you noticed that at the end of the day someone reports 
on the stock market and always gives a reason why this or that stock (or the 
entire nasdaq) rose or fell?     They like to pretend they understand why but in 
reality it's just a wild guess.       It's that way with why you win and lose 
games.   The reasons are very complicated - you cannot just count the good moves 
and bad moves to get a global "score" nor can you construct a good problem suite 
that will tell you how good your program is (unless you are willing to accept 
about 2 or 3 Dan worth of error.)  

I am always interested in ideas like what you propose here and I think it should 
be experimented with - I'm just very skeptical.   Been there, done that.  

Years ago there were tests based on counting how many moves a player made that 
matched games of top grandmasters.  You would hide the answer,  pick a move, 
then see if it matched the move "Bobby Fischer" would play.    Then your skill 
would be computed based on how many of these you got "right."      However 
recent studies show that 2 equally stronger players (computers in this case) can 
vary significantly in choice of moves and in fact a similarity test was 
constructed that showed that different equally strong programs played much more 
differently than 2 versions of the SAME program where one version was older 
and significantly weaker.     So the choice of move had a lot more to do with 
style than it did strength.    That is the very opposite of most peoples 

A program can play very well but be weak because of the occasional horrible 
move,   or it can play very "average" without making major errors but still 
missing a lot of "brilliancies."    There is complex interaction here that 
determines the actual RESULTS it will get in competition. 

However I still like your idea and if you (or anyone else) tries it I am very 
interested in your experiences with it.    I am always hoping something better 
will come along ...

By the way,  there is also the notion of weighted problem sets.    I think your 
idea has similarities.   You try to solve problems but some problems get much 
different "scores" than others.    This is an attempt to improve naive problem 
testing.   Another idea that has been tried is to test as many programs as 
possible of widely varying strength on a zillion problems and then fit a scoring 
method that predicts strength based on this.  



>Computer-go mailing list
>Computer-go at dvandva.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://computer-go.org/pipermail/computer-go/attachments/20110407/e3c009a1/attachment.html>

More information about the Computer-go mailing list