[Computer-go] A Linear Classifier Outperforms UCT on 9x9 Go
Brian Sheppard
sheppardco at aol.com
Wed Jun 29 14:53:38 PDT 2011
Note that using binary coding for the input layer reduces the complexity for
this case. Training is still proportional to Board Size, though.
From: computer-go-bounces at dvandva.org
[mailto:computer-go-bounces at dvandva.org] On Behalf Of Brian Sheppard
Sent: Wednesday, June 29, 2011 5:40 PM
To: computer-go at dvandva.org
Subject: Re: [Computer-go] A Linear Classifier Outperforms UCT on 9x9 Go
>Such a table has 47 million entries in 19x19 go and over half a million in
9x9 go. It require an enormous amount of data to fill such a table with
enough samples to not be almost meaningless statistically.
That's true, but... But there are two other aspects to the math.
1) Classifiers are expensive to train
Note that the paper compares on fixed trials rather than fixed time.
A network with N outputs, H hidden nodes and K inputs has H * (N + K)
weights. Here, K is the board size (9x9 = 81), and N varied from 10 to 40,
IIRC. Orego is training those weights on every position in the playout (That
is, roughly K moves worst case). Each update involves dozens of floating
point calculation.
A typical Pebbles playout does *no* work that is proportional to the number
of squares on the board. In the tree search, only the RAVE calculations are
proportional to the number of empty squares. There is no operation (not even
the UCB, if you replace SQRT and LOG with table lookups) that requires
dozens of floating point calculations.
Basically: RAVE uses (a lot) less computation and gets a result that is
better. (But maybe not much better; hard to tell because there is no
comparison to Orego+RAVE.)
2) A progressive bias strategy table conditioned on a couple of moves could
be a practical way to augment UCT
This is not instead of patterns, but in addition to them. Like
progressiveBias[M] += K * ConditionalResultsMemory[Penultimate][Last][M],
where K is a smallish number.
For 19x19 you have 47 million entries, but most are zero. That is, they are
never sampled and therefore don't affect the bias result. Which is fine.
Consider that Peter has proven that LGRF2 works, and it stores only a
probability distribution consisting of a single move.
You also collect a huge amount of data. For example, 361 observations per
playout, tens of thousands of playouts per turn, and probably several turns.
And it's O(1) to do an update. For example, incorporate Peter's observation
about forgetting by updating according to (1 - alpha) *
ConditionalResultsMemory[Penultimate][Last][M] + alpha * Result.
MCTS doesn't need much to make an improvement. I am finding all the time
that small tweaks pick up a 1% gain over a panel of opponents. You just need
small biases in favor of better outcomes.
From: computer-go-bounces at dvandva.org
[mailto:computer-go-bounces at dvandva.org] On Behalf Of Don Dailey
Sent: Wednesday, June 29, 2011 2:40 PM
To: computer-go at dvandva.org
Subject: Re: [Computer-go] A Linear Classifier Outperforms UCT on 9x9 Go
On Wed, Jun 29, 2011 at 1:17 PM, Brian Sheppard <sheppardco at aol.com> wrote:
Why is a classifier better than having a lookup table indexed by
OurLastMove, OppLastMove, ProposedNextMove that returns the Wins / Trials
experienced when ProposedNextMove is played after the sequence OurLastMove,
OppLastMove?
Such a table has 47 million entries in 19x19 go and over half a million in
9x9 go. It require an enormous amount of data to fill such a table with
enough samples to not be almost meaningless statistically.
A classifier tries to be much smarter by generalizing, either by rules, by
NN, naive bayes, or some other methods.
And I think this classifier is a Neural Netowrk that is built on the fly.
It could be built in advance, but I think the power of it is that it's built
from scratch before each move so it's much more relevant for the exact
position that is being searched. I suspect it would be much less useful if
it were generated in advance based on thousands of games.
It sounds crazy to me that it works at all as it has no real knowledge of
the position. However if you were given 10 or 20 moves without being
able to see the starting position, you probably could deduce a lot of
interesting things about the starting position. The most obvious is that
you know some of the points which were not occupied. And if you assume
the moves are reasonable moves then you can also deduce certain things that
are probably true and false.
However a NN can probably see patterns in this we would not even notice or
consider. So I would not mind seeing the experiment expanded to longer
move sequences.
Are the training cases for your classifier selected from only the UCT nodes,
or also from playout nodes?
Is the output of your classifier used to initialize the Wins / Trials values
for legal moves in new UCT nodes? Is that done by assuming a fixed number of
trials (how many?) and setting Wins = ClassifierOutput * Trials?
Is that the only use of the classifier in the system?
From: computer-go-bounces at dvandva.org
[mailto:computer-go-bounces at dvandva.org] On Behalf Of Peter Drake
Sent: Wednesday, June 29, 2011 11:20 AM
To: computer-go at dvandva.org
Subject: Re: [Computer-go] A Linear Classifier Outperforms UCT on 9x9 Go
On Jun 28, 2011, at 9:39 PM, Imran Hendley wrote:
Hi, long-time lurker and occasional poster here,
Thank you for the paper. I hope you don't mind me asking a few very basic
questions, since I am having trouble understanding exactly what you are
doing.
Let's say we are using a linear classifier. Then our output (the predicted
move) should look like:
argmax_i (y[i]), where y[i] = w1[i] . m1 + w2[i] . m2 + b
Where each w[i] is a weight vector for location i on the board, the m's are
the (column) input vectors (which I assume are 1 at the move location and
zero elsewhere), and b is the bias term.
There is a separate bias for each move, so b in your formula should be b[i].
To train our classifier online, we want to do something like: (1) Generate a
prediction for a training example. (2) Calculate the error. (3) Update the
feature weights. (4) Repeat.
If I understand, online training happens during the course of one game, as
we are playing. Moreover, we are using our classifier to generate moves to
select in the first phase of our simulation, as a replacement for MCTS, and
before playouts.
Correct.
Now this is where I have to start guessing the details. Are our training
examples playouts, and is our error function just 0 if the playout wins, and
1 if it loses?
The "correct output" is 1 if the playout wins, 0 if it loses. The error is
the difference between the correct output and the actual output.
And as we run more playouts, the classifier will update its weights and
select a different sequence of moves in the first phase of our simulation
(analogous to selecting different paths down the search tree based on node
scores in MCTS)? And when we use up our allotted time for one turn we just
return the next move (from the current position) that our classifier
predicts, based on its current weights?
We tried this, but the classifier fluctuates quite a bit. (This is, we
think, a desirably property to keep up exploration.) Instead, we choose as
the actual move the move through which the most playouts were played.)
The paper says we fix the number of moves we select with the classifier
before running playouts (unlike starting from the root and expanding in
MCTS). This is where things start getting really fuzzy for me. Do we
propagate the results of a playout back up this sequence? i.e. if we get a
win, do we perform updates of our classifier for each two-move sequence in
the full sequence?
Yes. The classifier therefore learns from the entire playout, not just from
moves generated by the playout. (This is vaguely analogous to RAVE.)
I would really like to get to the deeper questions about interpreting what
is really going on, but I first need to make sure I am on the right page
here. Sincere apologies for the stupid questions. I really hope my
understanding didn't get derailed so early on that most of my questions in
this message are gibberish. But I did want to show that I actually made a
concerted effort to understand the paper before asking what on earth it is
all about!
No problem -- we look forward to any insights you can offer!
Peter Drake
http://www.lclark.edu/~drake/
_______________________________________________
Computer-go mailing list
Computer-go at dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://computer-go.org/pipermail/computer-go/attachments/20110629/cf87ef0b/attachment.html>
More information about the Computer-go
mailing list