Offline Reinforcement Learning

The goal of Offline Reinforcement Learning to to learn near-optimal policies from previously collectd data[1]. The following picture shows the difference among online RL, off-poliyc RL and offline RL.

In a lot of practical scenes, the huge amount on-time interaction between agent and environments is impossible, either because data collection is expensive (e.g., in robotics, educational agents, or healthcare) and dangerous (e.g., in autonomous driving, or healthcare). That's why the offline RL is gathering people's eyes.

The Difficulties of Offline RL

The most obvious challenge with offline RL is that, because the learning algorithm must rely entirely on the static dataset, there is no possibility of improving exploration: exploration is outside the scope of the algorithm.

A another challenge with offline RL is that RL is about making and answering counterfactual queries. We expect the learned policy to perform better than the behavior seen in the dataset, so we must execute a sequence of actions that is in some way different. The fundamental challenge with making counterfactual queries is distributional shift, the learned policy is trained under one distribution, it will be evaluated on a different distribution.

References

[1] Levine, Sergey, et al. "Offline reinforcement learning: Tutorial, review, and perspectives on open problems." arXiv preprint arXiv:2005.01643 (2020).

[2] Schulman, J., Abbeel, P., and Chen, X. Equivalence be-tween policy gradients and soft Q-learning.arXiv preprintarXiv:1704.06440, 2017a.

[3] Liu, Q. and Wang, D. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances In Neural Information Processing Systems, pp. 2370–2378, 2016