Abstract: We consider a simple reinforcement learning framework known as episodic Markov Decision Processes, which is a repeated game where a learner has to navigate an environment and learn the underlying transitions and reward functions with the goal of maximizing its cumulative reward.
We investigate the effect of corruption on both the transition and the reward functions. Specifically, it has been shown that corrupting the transitions is more detrimental for the learner than corrupting the rewards as a linear amount of corruption on the transitions can lead to linear regret, whereas it is still possible to achieve sublinear regret even with a linear amount of corruption on the rewards (also known as the adversarial regime).
In this work, our goal is to adapt to detangle both notions of corruptions and to propose methods that optimally adapt to each corruption measure.