reinforcement learning The Multi-Armed Bandit Problem Problem Setup The multi-armed bandit problem can be described as a Markov decision process, a tuple ⟨X,A,P,R,γ⟩⟨X,A,P,R,γ⟩, with only one state. X=xX=x is a finite set of states A is a finite set of actions $\mathcal{P}