[computer-go] A Chinese Puzzle

Richard Brown rbrown at uwsa.edu
Fri Sep 22 14:28:24 PDT 2006


I've begun the task of attempting to assign a unique integer
identifier to each known go player for whom it is possible to
unambiguously determine his or her identity.

I believe that such identifiers will be of some benefit to
go-programmers if they are shoe-horned into the data structure
described by Jacques Basaldúa.  (The identifiers, that is, not
the go-programmers!)  In particular I expect the identifiers
to be of benefit in permitting a machine-learned "fingerprint"
of a player's behavioral _style_ -- via a method of neural nets
and Bayesian reasoning such as that which was vaguely described
on the computer-go list, earlier.  (Stay tuned to this forum
for more about that; it progresses apace.)

Beyond the scope of the data structure and my own AI project,
I feel also that there is a place in the general go-playing
community for such standardization.  That alone is sufficient
reason for my undertaking the task, its potential usefulness
to programmers notwithstanding.

Standardizing Asian proper names for those who do not read
Chinese, Korean, or Japanese presents a unique challenge.
When the sources of those names are game records from all
over the globe, that challenge is magnified:  The myriad of
sources brings with it a myriad of representations.

One problem is that anyone whose name can be written using
Chinese Characters, which may include Japanese, Koreans, and
of course the Chinese people themselves, may find that his
or her name will be pronounced according to "local" rules,
rather than the way that person himself or herself pronounces
his or her own name!  Chinese, Japanese, and Koreans will all
have different pronunciations for the same written symbol.

But I think that it is important to note, for example, that no
matter how one might pronounce "高德納", he remains Donald Knuth.

Even for those who do read Chinese characters (hanzi), there
exist old-style hanzi, simplified hanzi, Singapore's obsolete
simplified hanzi, Korean hanja, and Japanese kanji (leaving
aside the issue of ancient obsolete forms).  Japanese kanji,
and Korean hanja, tend to be written using older, less simplified
forms than those in which Chinese is written today, as the Chinese
hanzi have undergone several modernizations, but kanji and hanja
too, in many cases, differ even from the older Chinese characters,
having been imported to Japan and Korea so long ago that the
forms diverged early-on.  [Cf., e.g., the "wei" character in
the Chinese "weiqi" and the "i" character in the Japanese "igo",
which are essentially two different forms of the same "ideogram".
This latter term may no longer be "PC"; I'm not sure about it.
Just don't call them hieroglyphics, please!]

Then there is the problem of Romanization.  Several different
systems of transliteration exist, each with their own peculiar
spelling conventions.  For Chinese names, the two major schemes
(pinyin vs. the older Wade-Giles system) are sometimes a source
of confusion, as for example, in "Wei Ch'i" vs. "weiqi".

For proper names, the situation can be even worse.

In Japanese names, because there are by design multiple
"readings" for kanji, that is, _on_ readings vs. _kun_
readings -- and perhaps more than one of each! -- the
situation may seem hopelessly bewildering to one who is
not initiated into their use, and mistakes will be made.

Korean names in particular present a thorny problem.

Regarding Romanization, over time, there have been many
"official" transliteration systems.  While Korean itself
is generally written using phonetic characters (hangul),
some words -- in particular proper names, the meat of the
matter here -- are often written using hanja.

A certain Mr. Cho (in hangul 조훈현; in hanja 曺薰鉉) may
see his name Romanized as Cho HoonHyun, Cho Hunhyun, Cho
Hun-hyon, Jo Hun-Hyeon, or -- get this -- Soo Hun Yong!

It may even transpire that some, being understandably
confounded by such a situation, will confuse him with
a young _lady_ whose name is variously transcribed as
Cho HyeYeon, Cho Hye-yeon, or Cho Heyeon!  To top it
all off, Mr. Cho will find that the Chinese refer to
him as Cao Xunxuan (曹薰铉).

If you have really good eyes, or if you can enlarge the
image of the page you're reading, you will note a few
subtle differences in the former (Korean hanja) and the
latter (Chinese hanzi) versions of the parenthesized
names above, but they both refer to the same individual.

The Japanese kanji version of Cho's name (曺薫鉉) differs
yet again, from either of the other versions.  And the
differences are not merely one of font or typeface.

Consider Google:  Unlike what happens when you Google in a
Western language, where the _font_ does not matter at all,
when you Google for each of the four variants above, you'll
get different results, just as you would if you Googled for
"Cho HoonHyun" vs. "Jo Hun-Hyeon".  Kids, try this at home!

And yet, all of these representations refer to the same individual.

In addition to this, multiple electronic sources for game
records may have -- nay, invariably have -- dissimilar
character-encodings for Asian characters.  Before unicode
came along, Asian characters were encoded for computers in
quite a variety of systems, which are, not surprisingly,
completely incompatible with each other.

Given this state of affairs, one might be tempted to go
along with Fred Astaire, who sang,

       You say EEither and i say EYEither;
       You say NEEither and i say NYEither.
       EEither, EYEither, NEEither, NYEither;
       Let's call the whole thing off!

       You say toMAYto, i say toMAHto
       You eat poTAYto and i eat poTAHto
       ToMAYto, toMAHto, poTAYto, poTAHto,
       Let's call the whole thing off!

And yet, something can be done about this "Chinese puzzle".

What it is, is that someone who can parse these writing
systems, and someone who knows about the culture of go
-- that is to say, someone who knows something of the
biographies of these famous and important go-players,
and someone who can identify them, and tell them apart
-- may establish a standard.

I find myself in that unique position; Jacques' project
has provided the opportunity for me to make it a reality.

With the multiplicity of variant representations, spellings,
and outright typos or misspellings, of these names in the
.sgf sources, it may seem a daunting, even impossible, task.

And of course the games from go-servers present another
problem, namely:  Who the heck are all those people.

I'm _sure_ I don't know.

But in the cases where I _do_ know exactly who someone is,
the task is one for which I find, happily, that I'm already
armed with the necessary and sufficient tools to get 'er done.

The database at gobase.org is particularly useful because it
already lists many of the variants [e.g., as for Cho above],
but being able to Google Asian web sites for Chinese, Korean,
and Japanese names in their native languages -- or in all
three languages -- is helpful too.  Certain other tools,
such as on-line character dictionaries, and software like
the NJStar suite, also serve to ease the matter considerably.

A collateral byproduct of the effort would be a table, such
that, given a string like "PB[Yi Se-tol]" from an .sgf file,
one could search the table for "Yi Se-tol" or "Li SeDol" to
discover that these are among the many aliases for Lee Sedol.

One could search for "李世乭" or "이세돌" to discover the same fact.

Similarly one could search for the Japanese kanji or Chinese
hanzi used to represent Lee (if that _is_ his real name!)

[As an aside, it's the same family name as that of Syngman
Rhee, South Korea's first president, whose own name is at
times given as "이승만" or "리승만", and which is spelt as
"Yi Seungman" or "Ri Seungman" under Revised Romanization,
and as "I Sŭngman" or "Ri Sŭngman" under McCune-Reischauer
Romanization.  (Another variant of McCune-Reischauer is used
as the official system in North Korea.)  Sheesh!]

Or, one could even search the table for "ľ´å¼ÎÄÐ", to find,
for example, that in the .sgf file, of unknown origin, they
have, holds the record of a game played by Kimura Yoshio.
(Where Kimura is the family name, and Yoshio the given name.)

[My apologies and condolences to those who are still using
7-bit mail-reading programs, as they will be unable to see
the characters above.]

--
Rich


More information about the computer-go mailing list