Soft Reinforcement Learning: More Details

Soft Q-learning learns policies for continuous state and actions. The soft Q-learning meets contraction property, hence, it can converge to the optimal policy. Here, we look into the certify process.

Soft Q-iteration

Theorem: soft Q-iteration. Let and be bounded and assume that and that exists. Then the fixed-point iteration

converges to and , respectively.

"Soft" Bellman Operator

Soft value iteration operator is defined as

This is a contraction mapping. This result is certified in [1], that we can get . Since the soft Bellman backup is a contraction, the optimal value function is the fixed point of the Bellman backup, and it can be found by optimizing for a Q-function for which the soft Bellman error is minimized at all states and actions.

Simple from the Soft Q-function

Soft Q-learning needs to sample from the policy, , both to take on-policy actions and to generate action samples for estimating the soft value function. Since the form of the policy is so general, the direct sampling is intractable. The paper adopt an approximate way to do this, let's see, Stein variational gradient descent (SVGD) [3]. firstly, they learn a state-conditioned stochastic neural network , parametrized by , that maps noise samples drawn from a distribution into unbiased action samples from the target EBM corresponding to . (secondly,) The induced distribution of the actions are , the goal is to find parameters so that the induced distribution approximates the energy-based distribution in terms of the KL divergence. We will list the details of SVGD in another blog. The paper also indicates the relation of their algos and policy-gradient based methods, which you can find more details in [2] (maybe we will organize these in another blog).

Soft Actor-Critic Demystified ([4])

SAC is defined for RL tasks involving continuous actions. The biggest feature of SAC is that it uses a modified RL objective function. Instead of only seeking to maximize the lifetime rewards, SAC seeks to also maximize the entropy of the policy. A high entropy in the policy will explicitly encourage exploration, to ensure that it does not collapse into repeatedly selecting a particular action that could exploit some inconsistency in the approximated Q-function.

SAC architecture

The optimization objective:

SAC makes use of three networks:

a state value function parameterized by
a soft Q-function parameterized by
and a policy function parameterized by

We train the value network by minimizing the following error:

The approximation of the derivative of the above objective to update the parameters of the V function:

We train the Q network by minimizing the following error:

where

For all s-a pairs in the experience replay buffer, we want to minimize the squared difference between the prediction of our Q function and the immediate (one time-step) reward plus teh discounted expected Value of the next state. is the target value function.

We use the below approximation of the derivative of the above objective is to update the parameters of the Q function:

We train the policy network by minimizing the following error

In order to minimize this objective, the authors use something called the reparameterization trick. This trick is used to make sure that sampling from the policy is a differentiable process so that there are no problems in backpropagating the errors. The policy is now parameterized as follows:

The epsilon term is a noise vector sampled from a Gaussian distribution. Now we can express the objective function as follows:

The normalizing function is dropped since it does not depend on the parameter . An unbiased estimator for the gradient of the above objective is given as follows:

References

[1] Fox, R., Pakman, A., and Tishby, N. Taming the noise in reinforcement learning via soft updates. In Conf. on Uncertainty in Artificial Intelligence, 2016.

[2] Schulman, J., Abbeel, P., and Chen, X. Equivalence be-tween policy gradients and soft Q-learning.arXiv preprintarXiv:1704.06440, 2017a.

[3] Liu, Q. and Wang, D. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances In Neural Information Processing Systems, pp. 2370–2378, 2016

[4] https://towardsdatascience.com/soft-actor-critic-demystified-b8427df61665