[computer-go] 50, 576 pro/dan games without repetitions nor easily
detectable problems
Jacques Basaldúa
jacques at dybot.com
Wed Sep 20 03:47:15 PDT 2006
House, Jason J. wrote:
> Does anyone have a method for converting .go to .sgf?
It is *not* a .go file. As explained it is a binary file.
The structure is explained in my message. With the remarks
about the size of each field, it is easy to translate it to
any language. In very inadequate languages in which you could
have problems, you can always ready it as an array of arrays
of 640 bytes each and move the data as required.
Hellwig Geisse wrote:
> I have a question:
> what was the exact method you used to compute the
> hash values?
This is a rather lousy custom hash involving multiplication
and bit logic, so it looks good and it could even be good, but
as it is not studied, it could also have terrible properties
making it very weak. I consider it enough for the purpose.
Neither is it particularly fast to compute. Its not a great
invention, its just the first thing i tried.
In pseudo-code:
HashLast = 0
Hash100 = 0
Hash200 = 0
for i = 1 to NMoves
x = mov[i - 1].x
y = mov[i - 1].y
lo = 23*x*(i + 1) + x
hi = 29*y*(i + 1) + y
lo = (hi << 16) XOR lo
lo = lo XOR HashLast
ROTATE lo 1 bit to the right as a 32 bit register
HashLast = lo
if i == 100 then Hash100 = lo
if i == 200 then Hash200 = lo
end for
Why do we need hash anyway ?
It is for checking if the game is in the db.
What is certain is: Same game => same hash.
The reciprocal, should be: The prob of same hash and different game
is < 1/2^64 (and < 1/2^96 if Hash200 <> 0) because the full 4 x 32 bit
Hash100-Hash200-HashLast-NMoves are considered.
That can rarely be stated as it cannot in this case, but note
that the only consequence is that failure would give a false positive.
I.e. considering the game is already in the db and erasing it.
Nothing serious. If I manually corrected SGF warnings game by game, I could
admit many more games, but that small increase is not worth the effort.
The database is sorted by Hash100-Hash200-HashLast-NMoves.
Setting all hashes = 0 should improve file compression in about 600 Kb
(50K x 12) because hashes disturb compression.
David Fotland wrote:
> Did you get the 40K IGS amateur games off the MFGO CD?
No. I did not, at least not knowing it. I don't know who collected
both 40K collections (they are very similar anyway). What I call MFOG
is found in Many Faces Of Go v 11. It is not a CD, it is a program
which is installed in a University Leisure Club (mainly a chess club).
I ignore who installed the software. It produced only 61+6 new entries.
The sources:
Many collections claim to be manually revised by dan players. That's probably
the reason they have illegal moves. For testing the legality of a move,
nothing like a computer. Revising is a rutinary task and produces errors,
no matter how good the player is.
They may be transcripted from magazines or who knows. The AngYue collection
has 622 games repeated from 6740 (10 %).
Of course, when a game is in the db its source is not changed, therefore
the first sources have higher chances to create new entries.
srAngYue http://www.aygoschool.com/progames.html
sr40K_v1 http://planar.free.fr/42784games.tar.bz2
sr40K_v2 a similar 40K collection not available by http
srKGS http://www.u-go.net/gamerecords/
srMFOG Many Faces Of Go (an old version, probably)
srJIGO http://www.joot.com/jigo/games.shtml
srEuroAm http://sgf.mgoetze.net/
And, last but not least, Richard Brown wrote:
> A cursory glance at your data structure and web page indicates that
> the identity of the players has been discarded. Is that so?
Yes. There is no player information. Typical interest in a masters
game collection are: openings, patterns for sorting heuristics, neural
networks training and many statistically based methods.
What you propose sounds interesting, but I don't know if the information
has enough quality for that. If you are really interested, I could
send you the PB PW properties of a collection as they are. (Some, but few,
are even in Chinese.) If you can sort it, avoid repetitions, note misspelled
names, etc. and return a file like this: (Assume Abe_Kamejiro has ID 1,
An_ChoYoung has ID 2)
Abe Kamejiro; 1
Kamejiro, Abe; 1
Kamejiro; 1
KAMEJIRO; 1
Camejiro; 1
AbeKam; 1
An ChoYoung; 2
...
(Where the left part are all the different ways the player's name was
found in what I sent you and the right part is the player's ID.) I know
its a dirty job ;-)
We could create a *public* list of player names with numerical IDs
and add the ID to the database if the player is known, or 0 if unknown.
We don't need to do the whole database in one time. Say I send you the
PB, PW of AngYue 5521x2 (only those who are in the db obviously), if you
return the player's list, the next time it can be used with the new
collection and only unkown players must be sorted at that time.
Jacques.
More information about the computer-go
mailing list