[Computer-go] Evaluating improvements differently

Álvaro Begué alvaro.begue at gmail.com
Thu Apr 7 13:55:17 PDT 2011

Just to be clear, the whole point of what I said is that it's easier
to quantify the quality of the scores returned by the search than to
quantify the quality of the moves. I know people have used problem
sets before with little success, and what I am suggesting is
different, even though it might sound similar.


On Thu, Apr 7, 2011 at 4:48 PM, Don Dailey <dailey.don at gmail.com> wrote:
> On Thu, Apr 7, 2011 at 4:27 PM, terry mcintyre <terrymcintyre at yahoo.com>
> wrote:
>> To address Don's points, here are a few ideas:
>> One, pick situations where the choice of good moves is limited. These will
>> probably be "all or nothing" battles. If you pick move A, you win.
>> Otherwise, you lose, tough luck.
> The test will have some value,  but this does not simulate what actually
> happens in games,  only a subset of that.   It may be for example that in
> positions where there are 10  plausible moves is where a lot of the ELO
> comes from.   I don't really know, this is just an example.
> My own speculation on this is that by far the most relevant test is the
> kinds of moves you need to AVOID playing.   Go (and just about every other
> game) is more about avoiding stupid moves and errors than it is in finding
> some profound moves.   You don't "produce" wins,  you take them when there
> is opportunity and opportunity is provided by your opponent.
> Of course the program has to be skilled enough to find good moves when they
> exist, but the biggest problem in computer go is weak moves by the computer
> (where weak means giving the opponent opportunities.)
> The computer goes wrong by far the most where there is not a specific goal
> or target that it can understand - that's when it plays a "weak" move that a
> strong player will pounce on.
> In tennis,  I try to think in terms of getting the ball over the net so that
> it's the opponents turn to make a mistake.   This is an oversimplification,
> but it is essentially how it works.    Each point in tennis is lost by one
> of the players,  it's never "won" but of course we try to hit great shots so
> that the opponent has a much greater chance to "lose" the point.   However
> trying to hit those great shots is what causes you to lose a lot of points
> :-)
> Don
>> Two, test for proper followup. It's not enough to determine A, then forget
>> about the battle and play somewhere in another corner. If this is an
>> all-or-nothing battle, the program must continue to play the proper response
>> to every threat, until the job is done.
>> For extra credit, test what happens if the opponent continues to thrash
>> after the situation becomes hopeless - the program should not waste a move
>> in that area if it is not needed, but should take sente elsewhere.
>> In short, the program must master not only move A, but B, C, D, E, ... Z -
>> and one should test different move orders for the opponent, including feints
>> in other parts of the board.
>> Having mastered a few such test cases, now introduce a ko or seki to make
>> the analysis a bit more complex. Rinse and repeat.
>> Terry McIntyre <terrymcintyre at yahoo.com>
>> Unix/Linux Systems Administration
>> Taking time to do it right saves having to do it twice.
>> ________________________________
>> From: Don Dailey <dailey.don at gmail.com>
>> To: computer-go at dvandva.org
>> Sent: Thu, April 7, 2011 3:33:36 PM
>> Subject: Re: [Computer-go] Evaluating improvements differently
>> On Thu, Apr 7, 2011 at 2:27 PM, Álvaro Begué <alvaro.begue at gmail.com>
>> wrote:
>>> Hi,
>>> I haven't spent any time in go programming recently, but a few months
>>> ago I thought of a method to evaluate proposed improvements that might
>>> be much better than playing a gazillion games. A search results in two
>>> things: A move and a probability of winning (or a score that can be
>>> mapped into a probability of winning, but let's ignore that issue for
>>> now). Evaluating whether the moves picked by a strategy are good is
>>> really hard, but evaluating whether the estimate of a probability of
>>> winning is a good estimate seems much easier.
>>> For instance, take a database of games played by strong players.
>>> Extract a few positions from each game. Run your engine for some fixed
>>> amount of time on each position, and measure how well it predicted the
>>> winner after each position (cross entropy is probably the correct
>>> measure to use). Do this before and after the proposed modification
>>> and compare.
>>> Of course one has to be careful to pick reasonably well-played games
>>> (games played by top engines with more time per move than you'll use
>>> to evaluate your engine seems good enough, and will result in a much
>>> cleaner database than collecting games played by humans) and to have a
>>> large enough and varied enough set of positions. Also, one should
>>> worry about over-fitting for those particular positions, but one could
>>> use another set of positions for confirmation. These problems all seem
>>> manageable to me.
>>> It is possible that certain improvements can really only be measured
>>> by playing games (time control comes to mind), but I would imagine
>>> that for a large class of things this procedure can give you useful
>>> results in much more reasonable times.
>>> Your thoughts are appreciated.
>> I think this should be tried,  but I also believe you will be bitterly
>> disappointed.   The holy grail for me is to find some way to quickly test
>> versions and I have found that NO test does that better than playing games.
>>     The problem is that a real game is full of complex variables that
>> interact in strange and wonderful ways.
>> The reason a program wins or loses games is not as simple as just saying
>> it played a bad move,  or it failed to see this or that .   It's rather more
>> like the stock market.    Have you noticed that at the end of the day
>> someone reports on the stock market and always gives a reason why this or
>> that stock (or the entire nasdaq) rose or fell?     They like to pretend
>> they understand why but in reality it's just a wild guess.       It's that
>> way with why you win and lose games.   The reasons are very complicated -
>> you cannot just count the good moves and bad moves to get a global "score"
>> nor can you construct a good problem suite that will tell you how good your
>> program is (unless you are willing to accept about 2 or 3 Dan worth of
>> error.)
>> I am always interested in ideas like what you propose here and I think it
>> should be experimented with - I'm just very skeptical.   Been there, done
>> that.
>> Years ago there were tests based on counting how many moves a player made
>> that matched games of top grandmasters.  You would hide the answer,  pick a
>> move, then see if it matched the move "Bobby Fischer" would play.    Then
>> your skill would be computed based on how many of these you got "right."
>>  However recent studies show that 2 equally stronger players (computers in
>> this case) can vary significantly in choice of moves and in fact a
>> similarity test was constructed that showed that different equally strong
>> programs played much more differently than 2 versions of the SAME program
>> where one version was older and significantly weaker.     So the choice of
>> move had a lot more to do with style than it did strength.    That is the
>> very opposite of most peoples intuition.
>> A program can play very well but be weak because of the occasional
>> horrible move,   or it can play very "average" without making major errors
>> but still missing a lot of "brilliancies."    There is complex interaction
>> here that determines the actual RESULTS it will get in competition.
>> However I still like your idea and if you (or anyone else) tries it I am
>> very interested in your experiences with it.    I am always hoping something
>> better will come along ...
>> By the way,  there is also the notion of weighted problem sets.    I think
>> your idea has similarities.   You try to solve problems but some problems
>> get much different "scores" than others.    This is an attempt to improve
>> naive problem testing.   Another idea that has been tried is to test as many
>> programs as possible of widely varying strength on a zillion problems and
>> then fit a scoring method that predicts strength based on this.
>> Don
>>> Álvaro.
>>> _______________________________________________
>>> Computer-go mailing list
>>> Computer-go at dvandva.org
>>> http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
>> _______________________________________________
>> Computer-go mailing list
>> Computer-go at dvandva.org
>> http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
> _______________________________________________
> Computer-go mailing list
> Computer-go at dvandva.org
> http://dvandva.org/cgi-bin/mailman/listinfo/computer-go

More information about the Computer-go mailing list