# Multi-armed bandits with discount factor near one: the Bernoulli case

**
F. P. Kelly
**

Annals of Statistics,
9 (1981) 987--1001.

Available from
JSTOR

## Abstract

Each of n arms generates an infinite sequence of Bernoulli random variables.
The parameters of the sequences are themselves random variables, and are
independent with a common distribution satisfying a mild regularity
condition. At each stage we must choose an arm to observe (or pull) based on
past observations, and our aim is to maximize the expected discounted sum of
the observations. In this paper it is shown that as the discount factor
approaches one the optimal policy tends to the rule of least failures,
defined as follows: pull the arm which has incurred the least number of
failures, or if this does not define an arm uniquely select from amongst the
set of arms which have incurred the least number of failures an arm with the
largest number of successes.

Keywords: Bernoulli bandit process, Markov decision process, multi-armed
bandit, limit rule, play-the-winner rule, least failures rule, discount
optimality

Google Scholar citations.

Frank Kelly's
papers