Soft Reinforcement Learning: A Shallow Overview
What does "Soft" mean?
Soft Q-learning [1] is classical Q-learning with a entropy-regularized item. The "Soft" is relative to conventional approach, which uses a "hard" max in Bellman equation. Comparing to "hard" approach, the "soft" can provide some benefits [2]:
- Better Exploration
- Fine-tuning Maximum Entropy Policies
- Compositionality
- Robustness
Traditional Bellman Equation In MDPs, the
action-value function of a state-action pair
The optimal policy can be derived by:
Maximum Entropy Policies
Entropy and Max Entropy Principle Entropy is an old concept in physicals that is used to describe the randomness in environments. The greater the entropy, the more random the actions the policy gives. The discrete form of entropy is:
Similarity, the entropy term for policy has this form:
The entropy term of policy can help the policy to increase the expoloration ability, by adding more possibilities to some rare actions. That could avoid the agent get stuck into local optimum and attain global optimum, to some extend. The idea of learning such maximum entropy model has its origin in statistical modeling, in which the goal is to find the probability distribution that has the highest entropy while still satisfying the observed statistics [1]. The principle of maximum entropy states that the probability distribution with the highest entropy, is the one that best represents the current state of knowledge in the context of precisely stated prior data [4].
Stochastic Policy and Maximum Entropy Policies
Conventional RL approach is to specify a unimodal policy distribution,
centered at the maximal Q-value and extending to the neighbouring
actions to provide noise for exploration. Usually, there would be a
sampling process which employ a Gaussian distribution,
As showed in the paper, the optimal policy for this equation is given by
where soft Q-function
The idea of learning such maximum entropy models has its origin in statistical modeling, in which the goal is to find the probability distribution that has the highest entropy while still satisfying the observed statistics.
Soft Bellman Equation and Soft Q-Learning The soft Bellman equation can be obtained
where
Two Challenges in Continuous Domains The first challenge is exact dynamic programming is infeasible, since the soft Bellman equation needs to hold for every state and action, and the softmax involves integrating over the entire action space. The solution is to employ expressive neural network function approximators.
The second is the optimal policy is defined by an intractable energy-based distribution, which is difficult to sample from. To address this problem, it can employ approximate inference techniques, such as MCMC. TO accelerate inference, the amortized Stein variational gradient descent [3] can be used to train an inference network to generate approximate samples.
References
[1] @article{haarnoja2017reinforcement, title={Reinforcement learning with deep energy-based policies}, author={Haarnoja, Tuomas and Tang, Haoran and Abbeel, Pieter and Levine, Sergey}, journal={arXiv preprint arXiv:1702.08165}, year={2017} }
[2] https://bair.berkeley.edu/blog/2017/10/06/soft-q-learning/
[3] Liu, Q. and Wang, D. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances In Neural Information Processing Systems, pp. 2370–2378, 2016.
[4] https://towardsdatascience.com/entropy-regularization-in-reinforcement-learning-a6fa6d7598df