# [Computer-go] Assessing Improvements

Brian Sheppard sheppardco at aol.com
Sat Feb 19 19:14:24 PST 2011

>I'd like to hear people's thoughts on how to best & most quickly determine
if it is in fact an improvement.

Before commencing on a statistical study, I recommend very thoroughly
testing your change by seeing if it does what you want it to do. Did you
really capture the notion that you mean? Did you neglect some low-liberty
situation? Or an odd case where two stones are part of the same string? Does
your code work under all symmetries?

Another key test is how it would behave in real game positions. You should
save your program's own games, of course. And you can download tons of games
from the Internet. Would your policy exclude (or upgrade/downgrade) the move
played? How does it behave on the moves played by winners? Does your policy
work equally well in opening/midgame/endgame? Or Black/White? Or
your/opponent moves? Or human/computer moves?

Something that I would like to do, but have not gotten around to, is to
generate cases where the program makes different moves under the new policy.
Maybe something like: play games using one policy, then use the other policy
as critic, and submit differences of opinion to an oracle. Maybe the oracle
could be your own program when thinking for 10x as long, or maybe an
independent arbiter like Fuego or Pachi, or have GnuGo play it to the end
after both moves, or maybe a bunch of oracles.

Statistical study is an obvious technique, but should be used in
combination.

- Self-play tends to exaggerate differences, because the players are so
similar to each other. Programs tend to drift into situations where they
disagree, and in this case they will disagree about exactly one thing. (This
doesn't mean that you shouldn't do self-play; just balance this against
other evidence.)

- Testing against any specific opponent will eventually lead to defeating
that opponent. Your testing method will bias your decisions about what to
work on.

- One variant can mishandle some early game situation, leading to losses
that would not occur in competitive games. Accordingly, you should do 9x9
testing using a book. E.g., Fuego has a reasonable manually selected book.
At a minimum, you should have your program randomly choose between opening
with D5 and E5 as black.

- Differences are likely to be small.

- It is quite likely that your variation would succeed, if it were combined
with other changes. E.g., your program handles some cases, but now runs into
more difficult situations.

- You might get different results at different search efforts. Adding
"knowledge" might look good at fast searches, but be meaningless at slow
searches. A lot depends on how much search effort is necessary to find this
issue on its own.

- You should know what your variation costs in execution speed. I wouldn't
make too much of this, but you should measure. By the time your program is
really strong, it will be so slow that you can add almost anything without
making it slower. :-)

I like CGOS testing, because it gives me data against diverse opponents at a
relatively slow playing speed.

Wait, did you say "quickly"? Well, then never mind...

Brian

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://computer-go.org/pipermail/computer-go/attachments/20110219/43a67ae6/attachment.html>