# [Computer-go] Multi-armed bandit problem theory

Brian Sheppard sheppardco at aol.com
Wed Oct 26 16:19:56 PDT 2011

```I tried to find those papers once, too, and I also failed. It seems to
predate Internet publishing. Some key results were proved in the 1970's.

I think the following ideas sketch a proof:

- sample each arm infinitely often
- such that the fraction of effort going to sub-optimal arms
decreases to zero

Now look at the form of Hoeffding's inequality:
http://en.wikipedia.org/wiki/Hoeffding's_inequality. The UCB term is like
the functional inverse of the right-hand side. That is why it works for
bandit problems.

To go from bandit theory to UCT, you have to show that recursive application
of the same process converges, which is the proof that you see in the Auer
paper.

Note that there are solutions to the bandit problem that would not be
solutions for MCTS, because they do not converge recursively. E.g., you can
make an epsilon-greedy policy that works for a bandit problem ("works" ==
zero asymptotic regret), but would not converge to optimal in an MCTS
setting.

Hope this helps.

Brian

-----Original Message-----
From: computer-go-bounces at dvandva.org
[mailto:computer-go-bounces at dvandva.org] On Behalf Of Petr Baudis
Sent: Wednesday, October 26, 2011 8:51 AM
To: computer-go at dvandva.org
Subject: Re: [Computer-go] Multi-armed bandit problem theory

On Wed, Oct 26, 2011 at 01:56:03PM +0200, "Ingo AlthÃ¶fer" wrote:
> Not a direct answer, but some bit of information:
> Bandit theory started in the early 1950' by Herbert Robbins
> (the same Robbins from the 1985 paper). However, he did
> not prove best possible bounds in the seminal paper.

Yes, I actually have a copy of that paper but it doesn't seem that it
could help me better understand the later results.