Soft Reinforcement Learning: A Shallow Overview

What does "Soft" mean?

Soft Q-learning [1] is classical Q-learning with a entropy-regularized item. The "Soft" is relative to conventional approach, which uses a "hard" max in Bellman equation. Comparing to "hard" approach, the "soft" can provide some benefits [2]:

Better Exploration
Fine-tuning Maximum Entropy Policies
Compositionality
Robustness

Traditional Bellman Equation In MDPs, the action-value function of a state-action pair under policy , denoted as , is the expected return when starting from , following action and thereafter. The action-value Bellman equation and Bellman optimality equation are defined as:

The optimal policy can be derived by:

Maximum Entropy Policies

Entropy and Max Entropy Principle Entropy is an old concept in physicals that is used to describe the randomness in environments. The greater the entropy, the more random the actions the policy gives. The discrete form of entropy is:

Similarity, the entropy term for policy has this form:

The entropy term of policy can help the policy to increase the expoloration ability, by adding more possibilities to some rare actions. That could avoid the agent get stuck into local optimum and attain global optimum, to some extend. The idea of learning such maximum entropy model has its origin in statistical modeling, in which the goal is to find the probability distribution that has the highest entropy while still satisfying the observed statistics [1]. The principle of maximum entropy states that the probability distribution with the highest entropy, is the one that best represents the current state of knowledge in the context of precisely stated prior data [4].

Stochastic Policy and Maximum Entropy Policies Conventional RL approach is to specify a unimodal policy distribution, centered at the maximal Q-value and extending to the neighbouring actions to provide noise for exploration. Usually, there would be a sampling process which employ a Gaussian distribution, . This would make the policy ignore the action area with low Q-value and thus reduce the exploration ability. Do we have a method to solve this problem? Yes! An obvious solution is to ensure the agent explores all promising states while prioritizing the more promising ones. One way to formalize this idea is to define the policy directly in terms of exponentiated Q-values: . This density has the form of the Boltzman distribution. What's more, the policy defined through the energy form is an optimal solution for the maximum-entropy RL objective

As showed in the paper, the optimal policy for this equation is given by

where soft Q-function and soft value function are defined by:

The idea of learning such maximum entropy models has its origin in statistical modeling, in which the goal is to find the probability distribution that has the highest entropy while still satisfying the observed statistics.

Soft Bellman Equation and Soft Q-Learning The soft Bellman equation can be obtained

where . The soft Bellman equation satisfy contraction property, that means it can convergence to optimal value. We can adopt conventional algorithms for the soft.

Two Challenges in Continuous Domains The first challenge is exact dynamic programming is infeasible, since the soft Bellman equation needs to hold for every state and action, and the softmax involves integrating over the entire action space. The solution is to employ expressive neural network function approximators.

The second is the optimal policy is defined by an intractable energy-based distribution, which is difficult to sample from. To address this problem, it can employ approximate inference techniques, such as MCMC. TO accelerate inference, the amortized Stein variational gradient descent [3] can be used to train an inference network to generate approximate samples.

References

[1] @article{haarnoja2017reinforcement, title={Reinforcement learning with deep energy-based policies}, author={Haarnoja, Tuomas and Tang, Haoran and Abbeel, Pieter and Levine, Sergey}, journal={arXiv preprint arXiv:1702.08165}, year={2017} }

[2] https://bair.berkeley.edu/blog/2017/10/06/soft-q-learning/

[3] Liu, Q. and Wang, D. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances In Neural Information Processing Systems, pp. 2370–2378, 2016.

[4] https://towardsdatascience.com/entropy-regularization-in-reinforcement-learning-a6fa6d7598df