[computer-go] analysis of UCT and other bandit algorithms for
tree search
Jason House
jason.james.house at gmail.com
Tue Apr 24 10:30:44 PDT 2007
I haven't gotten to read through your paper as carefully as I'd like yet,
but I do have a few observations that may be of benefit to other readers on
the list... Mostly observation of assumptions used in the paper.
1. A max tree is used instead of minimax
2. Rewards can be more than just 1 or 0 (win or loss).
3. The goal is to find the optimal leaf rather than the optimal subtree.
I'm not trying to say anything about the usefulness of the result, but
rather helping people decide how interested they'd be in the paper. There
are likely counters to the points above. For example, the addresses #1 by
saying "the minimax problem is a direct generalization of the results
presented"
On 4/24/07, Remi Munos <remi.munos at inria.fr> wrote:
>
> Hi,
>
> I have been encouraged to post a message to this list because we wrote a
> paper
> on the analysis of UCT and other bandit-based methods for tree search, and
> this might be useful for computer-go.
>
> The paper "Bandit Algorithms for Tree Search" is available as an INRIA
> technical report at: http://hal.inria.fr/inria-00136198
>
> In short, we start by explaining why UCT may perform very poorly in some
> cases
> because of excessive optimism. The regret of UCT is asymptotically O(log
> n)
> but with a initial period that may be prohibitively long.
>
> In order to avoid this problem, we need to modify the confidence sequence
> so
> that to increase exploration when it is necessary. We propose an algorithm
> (bandit algorithm in smooth trees) that takes into account possible
> smoothness in the tree, namely that the "bias" of the rewards along an
> optimal branch is decreasing with the depth of the considered nodes in the
> branch, where the "bias" of a node is defined as the difference between
> the
> value of the node and the expectation of the rewards obtained from that
> node
> (ie. the difference between the max, or min-max, value and a Monte-Carlo
> estimate of its value). Like UCT, the choice of an action depends on upper
> confidence bounds, but here, the bound of each node is computed in terms
> of
> the bounds of its children, as well as a direct Chernoff-Hoeffding bound
> based on the average rewards and the smoothness of the tree. We show that
> the
> algorithm spend a O(log(n)) time on suboptimal branches, thus essentially
> exploring the optimal branch. The prior smoothness coefficient of the tree
> is
> a parameter of the algorithm. If set to 0, this algorithm is nothing else
> than UCT. If set to infinity, this algo is a flat-UCB performed on the
> leaves. If a correct parameter is set, it performs better than both
> extremes.
> Of course the setting of a correct value of the parameter for the specific
> game of go is left to the experts, which I am not, unfortunately!
>
> Since several go programs are using UCT ideas, we thought this modified
> version may be of interest. The paper consider the max-max tree case but
> extension to min-max search is straightforward.
>
> Best, Remi Munos
>
> _______________________________________________
> computer-go mailing list
> computer-go at computer-go.org
> http://www.computer-go.org/mailman/listinfo/computer-go/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://computer-go.org/pipermail/computer-go/attachments/20070424/46426c03/attachment.htm
More information about the computer-go
mailing list