[Computer-go] dealing with multiple local optima

Álvaro Begué alvaro.begue at gmail.com
Fri Feb 24 11:21:40 PST 2017

I should point out that Reinforcement Learning is a relatively unimportant
part of AlphaGo, according to the paper. They only used it to turn the
move-prediction network into a stronger player (presumably increasing the
weights of the layer before SoftMax would do most of the job, by making the
highest-probability move the vast majority of the time). This stronger
player was then used to label positions by the result of self-play, and
these labelled positions were then used to learn a "value network" (what we
call an "evaluation function" in computer chess). The final AlphaGo playing
program doesn't actually use the RI network at all.

> Is there any success case of a chess, go, or any kind of complex
strategic game playing algorithm, where it gained expert strength without
domain knowledge such as expert game examples?

Yes! In 2015 I made a Spanish checkers program that used a neural network
as evaluation function (121 binary inputs, two hidden layers with 32 units
each and ReLU activation functions and one tanh output unit). I started by
generating 6-men tablebases, because the endgames are hard to get right
with shallow searches, and it is critical to know them to learn what
"enough advantage to win" is. I bootstrapped the learning process by
playing games with random evaluation function and depth-6 search. Then I
used these games to train the neural network, then played games with the
neural network, at around 0.1s/move (that's about 100K nodes searched).
Rinse, repeat (3 times, I think). The first time around, the program didn't
really know if it was better to have more pieces than the opponent or
fewer. By the 3rd iteration it had at least as much positional knowledge of
the game as I do, and I spent a lot of time in my teenage years playing
this game.

One important detail: The first few moves of the games were played using
UCT as a sort of opening book, to create a lot of variety. I should also
point out that the last move in the UCT tree is pretty much random. I
believe both AlphaGo and Giraffe used random moves when generating
databases to learn from.

I sent the program to my friend José Manuel Morán, which is probably the
strongest player in Spain (there are several comparably strong players in
Portugal, where the game is more popular). After a few days his report was
"It cannot be beaten."

Perhaps alpha-beta on a modern computer is too successful for this game to
really be able to tell if the evaluation function was very good, but it was
certainly not bad, and I didn't use any human knowledge anywhere in the
process of producing it.


On Fri, Feb 24, 2017 at 3:39 AM, Minjae Kim <xiver77 at gmail.com> wrote:

> I've recently viewed the paper of AlphaGo, which has done gradient-based
> reinforcement learning to get stronger. The learning was successful enough
> to beat a human master, but in this case, supervised learning with a large
> database of master level human games was preceded the reinforcement
> learning. For a complex enough game as go, one can expect that the search
> space for the policy function would not be smooth at all. So supposedly
> supervised learning was necessary to guide the policy function to a good
> starting point before reinforcement. Without such, applying reinforcement
> learning directly to a random policy can easily make the policy stuck at a
> bad local optimum. I could have a miunderstanding at this point; correct me
> if so, but to continue on: if it is hard to have "the good starting point"
> such as a trained policy from human expert game records, what is a way to
> devise one. I've had a look on NEAT and HyperNEAT, which are evolutionary
> methods. Do these evolutionary algorithms scale well on complex strategic
> decision processes and not just on simple linear decisions such as food
> gathering and danger avoidance? In case not, what alternatives are known?
> Is there any success case of a chess, go, or any kind of complex strategic
> game playing algorithm, where it gained expert strength without domain
> knowledge such as expert game examples?
> _______________________________________________
> Computer-go mailing list
> Computer-go at computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://computer-go.org/pipermail/computer-go/attachments/20170224/fd2458da/attachment.html>

More information about the Computer-go mailing list