Problem Setup The multi-armed bandit problem can be described as a Markov decision process, a tuple $\langle \mathcal{X}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma \rangle$, with only one state. $\mathcal{X} = {x}$ is a finite set of states $\mathcal{A}$ is a finite set of actions $\mathcal{P}