Soft Reinforcement Learning: More Details
Soft Q-learning learns policies for continuous state and actions. The soft Q-learning meets contraction property, hence, it can converge to the optimal policy. Here, we look into the certify process.
Soft Q-iteration
Theorem: soft Q-iteration. Let
converges to
"Soft" Bellman Operator
Soft value iteration operator
This is a contraction mapping. This result is certified in [1], that we can get
Simple from the Soft Q-function
Soft Q-learning needs to sample from the policy,
Soft Actor-Critic Demystified ([4])
SAC is defined for RL tasks involving continuous actions. The biggest feature of SAC is that it uses a modified RL objective function. Instead of only seeking to maximize the lifetime rewards, SAC seeks to also maximize the entropy of the policy. A high entropy in the policy will explicitly encourage exploration, to ensure that it does not collapse into repeatedly selecting a particular action that could exploit some inconsistency in the approximated Q-function.
SAC architecture
The optimization objective:
SAC makes use of three networks:
- a state value function
parameterized by - a soft Q-function
parameterized by - and a policy function
parameterized by
We train the value network by minimizing the following error:
The approximation of the derivative of the above objective to update the parameters of the V function:
We train the Q network by minimizing the following error:
where
For all s-a pairs in the experience replay buffer, we want to
minimize the squared difference between the prediction of our Q function
and the immediate (one time-step) reward plus teh discounted expected
Value of the next state.
We use the below approximation of the derivative of the above objective is to update the parameters of the Q function:
We
train the policy network by
minimizing the following error
In order to minimize this objective, the authors use something called the reparameterization trick. This trick is used to make sure that sampling from the policy is a differentiable process so that there are no problems in backpropagating the errors. The policy is now parameterized as follows:
The epsilon term is a noise vector sampled from a Gaussian distribution. Now we can express the objective function as follows:
The normalizing function
References
[1] Fox, R., Pakman, A., and Tishby, N. Taming the noise in reinforcement learning via soft updates. In Conf. on Uncertainty in Artificial Intelligence, 2016.
[2] Schulman, J., Abbeel, P., and Chen, X. Equivalence be-tween policy gradients and soft Q-learning.arXiv preprintarXiv:1704.06440, 2017a.
[3] Liu, Q. and Wang, D. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances In Neural Information Processing Systems, pp. 2370–2378, 2016
[4] https://towardsdatascience.com/soft-actor-critic-demystified-b8427df61665