A Brief Summary on Inverse Reinforcement Learning
Inverse Reinforcement Learning
In Inverse Reinforcement Learning (IRL) setting, the reward function is not given explictly but the expert demonstrations are exposed to us. IRL seeks to recover a reward function from (near-)expert demonstration. In [3], the task of learning from an expert is called apprenticeship learning (also leaning by watching, imitation leaning, or learning from demonstration).
Algorithms for Inverse Reinforcement Learning
[3] propose a linear programming method to solve the problem by formulating origin question as a linearly, constrained optimization problem.
Theory Let a finite state space
It's easy to see that the condition
The key idea in this paper is to find a
reward function
The objective is:
For large state space, the linear programming formulation is
. This will be solved by Monte-Carlo simulation and LP method.
Apprenticeship Learning via Inverse Reinforcement Learning
Feature expectations is expressed as
Using this notation, the value of a policy may be written
The problem is the following: we seeks to
find a policy whose performance is close to that of the expert's, on the
unknown reward function
One important step of the algorithm is:
This is equivalent to finding the maximum margin hyperplane
separating two sets of points. So, an SVM solver can be used to find
Max Margin Planning and Learn to Search
[4] propose maximum margin planning method to do imitation learning. It frames the IL problem as a maximum margin structured prediction problem over a space of policies.
The input to the algorithm is a set of training instances
Maximum Margin Planning constrained optimization formulation:
In fact, inverse reinforcement learning is an ill-defined problem: there are many optimal policies that can explain a set of demonstrations, and many rewards that can explain an optimal policy. Max-Ent IRL is to handles the former ambiguity.
Maximum Entropy Principle
Entropy is an old concept in physicals that is used to describe the randomness in environments. The greater the entropy, the more random the actions the policy gives. The discrete form of entropy is:
Similarity, the entropy term for policy has this form:
The entropy term of policy can help the policy to increase the expoloration ability, by adding more possibilities to some rare actions. That could avoid the agent get stuck into local optimum and attain global optimum, to some extend. The idea of learning such maximum entropy model has its origin in statistical modeling, in which the goal is to find the probability distribution that has the highest entropy while still satisfying the observed statistics [1]. The principle of maximum entropy states that the probability distribution with the highest entropy, is the one that best represents the current state of knowledge in the context of precisely stated prior data [2].
Apprenticeship Learning Using Linear Programming
[5] One contribution, if one uses the linear programming approach for finding an MDP's optimal policy as a subroutine, then one can modify [6] algorithm so that it outputs a stationary policy instead of a mixed policy. Second contribution is the formulation of the apprenticeship learning problem as a linear program.
We say a policy
for all
A Game-Theoretic Approach to Apprenticeship Learning
[6] propose a method to produce a policy that may substantially better than the expert's based on a game-theoretic view of the problem. They pose the problem as learning to play a two-player zero-sum game in which the apprentice chooses a policy, and the environment chooses a reward function. The goal of the apprentice is to maximize performance relative to the expert, even though the reward function may be adversarially selected by the environment with respect to this goal.
References
[1] @article{haarnoja2017reinforcement, title={Reinforcement learning with deep energy-based policies}, author={Haarnoja, Tuomas and Tang, Haoran and Abbeel, Pieter and Levine, Sergey}, journal={arXiv preprint arXiv:1702.08165}, year={2017} }
[2] https://towardsdatascience.com/entropy-regularization-in-reinforcement-learning-a6fa6d7598df
[3] @inproceedings{abbeel2004apprenticeship, title={Apprenticeship learning via inverse reinforcement learning}, author={Abbeel, Pieter and Ng, Andrew Y}, booktitle={Proceedings of the twenty-first international conference on Machine learning}, pages={1}, year={2004} }
[4] @inproceedings{ratliff2006maximum, title={Maximum margin planning}, author={Ratliff, Nathan D and Bagnell, J Andrew and Zinkevich, Martin A}, booktitle={Proceedings of the 23rd international conference on Machine learning}, pages={729--736}, year={2006} }
[5] @inproceedings{syed2008apprenticeship, title={Apprenticeship learning using linear programming}, author={Syed, Umar and Bowling, Michael and Schapire, Robert E}, booktitle={Proceedings of the 25th international conference on Machine learning}, pages={1032--1039}, year={2008} }
[6] @inproceedings{syed2008game, title={A game-theoretic approach to apprenticeship learning}, author={Syed, Umar and Schapire, Robert E}, booktitle={Advances in neural information processing systems}, pages={1449--1456}, year={2008} }