The Multi-Armed Bandit Problem

Problem Setup

The multi-armed bandit problem can be described as a Markov decision process, a tuple $\langle \mathcal{X}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma \rangle$, with only one state.

$\mathcal{X} = {x}$ is a finite set of states
$\mathcal{A}$ is a finite set of actions
$\mathcal{P}$ is a state transition probability matrix
$\mathcal{R}$ is a reward function, the immediate reward function $r: \mathcal{X} \times \mathcal{A} \to \mathbb{R}$
$\gamma$ is a discount factor $\gamma \in [0, 1]$. It is called undiscounted if $\gamma=1$.

$\pi : \mathcal{X} \to \mathcal{A}$ is the policy; $Q^\pi : \mathcal{X} \times \mathcal{A} \to \mathbb{R}$ is the action-value function.

$$
\begin{align}
Q^\pi(x, a)
&= \mathbb{E} [\sum_{t=0}^{\infty} \gamma^t R_{t+1}\vert x, a] \newline
&= \mathbb{E} [\sum_{t=0}^{\infty} R_{t+1}\vert x, a] & \text{let}\ \gamma = 1
\end{align}
$$

The value function $V^\pi : \mathcal{X} \to \mathbb{R}$ is (when $\gamma=1$):

$$
\begin{align}
V^\pi(x) = \mathbb{E} [\sum_{t=0}^{\infty} R_{t+1}\vert x] \quad x \in \mathcal{X}
\end{align}
$$

The optimal value and action value function is:

$$
\begin{align}
V^*(x) = \sup_{a \in \mathcal{A}} Q^*(x, a) \quad x \in \mathcal{X}
\end{align}
$$

We can minimize the total regret as a proxy to maximize the cumulative reward:

Monte-Carlo Method

The idea is that one can estimate the value of a state by computing sample means
$$
\hat Q_{t}(x,a)=\frac {1} {N_t(x,a)} \sum_{\tau=0}^{t} r_{\tau} \unicode{x1D7D9}_{a_t = a}
$$

which can easily derive the so-called every-visit Monte-Carlo Method:
$$
\hat Q_{t+1}(x,a)=\hat Q_t(x,a)+\frac {1} {N_{t+1}(x,a)}(r_{t+1} - \hat Q_t(x,a)) \unicode{x1D7D9}_{a_t = a}
$$

Greedy Algorithm

$\epsilon$-greedy (exploration) strategy

Let next action $a_{t+1} = \underset{a \in \mathcal{A}}{\arg\max} \hat{Q}_t(x, a)$ with probability $1-\epsilon$
Draw next action $a_{t+1}$ randomly with probability $\epsilon$

Boltzmann exploration

Draw next action $a_{t+1}$ from the multinomial distribution $\frac {exp(\beta \hat{Q}_t(x, a))} {\sum exp(\beta \hat{Q}_t(x, a))}$

References

https://lilianweng.github.io/lil-log/2018/01/23/the-multi-armed-bandit-problem-and-its-solutions.html