The Multi-Armed Bandit Problem
Problem Setup
The multi-armed bandit problem can be described as a Markov decision process, a tuple $\langle \mathcal{X}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma \rangle$, with only one state.
- $\mathcal{X} = {x}$ is a finite set of states
- $\mathcal{A}$ is a finite set of actions
- $\mathcal{P}$ is a state transition probability matrix
- $\mathcal{R}$ is a reward function, the immediate reward function $r: \mathcal{X} \times \mathcal{A} \to \mathbb{R}$
- $\gamma$ is a discount factor $\gamma \in [0, 1]$. It is called undiscounted if $\gamma=1$.
$\pi : \mathcal{X} \to \mathcal{A}$ is the policy; $Q^\pi : \mathcal{X} \times \mathcal{A} \to \mathbb{R}$ is the action-value function.
$$
\begin{align}
Q^\pi(x, a)
&= \mathbb{E} [\sum_{t=0}^{\infty} \gamma^t R_{t+1}\vert x, a] \newline
&= \mathbb{E} [\sum_{t=0}^{\infty} R_{t+1}\vert x, a] & \text{let}\ \gamma = 1
\end{align}
$$
The value function $V^\pi : \mathcal{X} \to \mathbb{R}$ is (when $\gamma=1$):
$$
\begin{align}
V^\pi(x) = \mathbb{E} [\sum_{t=0}^{\infty} R_{t+1}\vert x] \quad x \in \mathcal{X}
\end{align}
$$
The optimal value and action value function is:
$$
\begin{align}
V^*(x) = \sup_{a \in \mathcal{A}} Q^*(x, a) \quad x \in \mathcal{X}
\end{align}
$$
We can minimize the total regret as a proxy to maximize the cumulative reward:
Monte-Carlo Method
The idea is that one can estimate the value of a state by computing sample means
$$
\hat Q_{t}(x,a)=\frac {1} {N_t(x,a)} \sum_{\tau=0}^{t} r_{\tau} \unicode{x1D7D9}_{a_t = a}
$$
which can easily derive the so-called every-visit Monte-Carlo Method:
$$
\hat Q_{t+1}(x,a)=\hat Q_t(x,a)+\frac {1} {N_{t+1}(x,a)}(r_{t+1} - \hat Q_t(x,a)) \unicode{x1D7D9}_{a_t = a}
$$
Greedy Algorithm
$\epsilon$-greedy (exploration) strategy
- Let next action $a_{t+1} = \underset{a \in \mathcal{A}}{\arg\max} \hat{Q}_t(x, a)$ with probability $1-\epsilon$
- Draw next action $a_{t+1}$ randomly with probability $\epsilon$
Boltzmann exploration
- Draw next action $a_{t+1}$ from the multinomial distribution $\frac {exp(\beta \hat{Q}_t(x, a))} {\sum exp(\beta \hat{Q}_t(x, a))}$