A Brief Summary on Inverse Reinforcement Learning

Inverse Reinforcement Learning

In Inverse Reinforcement Learning (IRL) setting, the reward function is not given explictly but the expert demonstrations are exposed to us. IRL seeks to recover a reward function from (near-)expert demonstration. In [3], the task of learning from an expert is called apprenticeship learning (also leaning by watching, imitation leaning, or learning from demonstration).

Algorithms for Inverse Reinforcement Learning

[3] propose a linear programming method to solve the problem by formulating origin question as a linearly, constrained optimization problem.

Theory Let a finite state space , a set of actions , transition probability matrices , and a discount factor be given. Then the policy given by is optimal if and only if, for all , the reward satisfies:

It's easy to see that the condition is necessary and sufficient for to be the unique optimal policy. It represents the gap of expected value on each state between acting optimally and acting suboptimally.

The key idea in this paper is to find a reward function such that it maximize the gap between expert policy and any other policies. To do so, on each state, we want to maximize the gap of expected value of acting optimally and the best expected value of taking suboptimal actions. In other words, we seek to maximize the sum of the differences between the quality of the optimal action and the quality of the next-best action:

The objective is:

For large state space, the linear programming formulation is

. This will be solved by Monte-Carlo simulation and LP method.

Apprenticeship Learning via Inverse Reinforcement Learning

Feature expectations is expressed as

Using this notation, the value of a policy may be written . Given the trajectories generated by the expert, we denote the empirical estimate for by

The problem is the following: we seeks to find a policy whose performance is close to that of the expert's, on the unknown reward function . To accomplish this, we will find a policy such that .

One important step of the algorithm is: .

This is equivalent to finding the maximum margin hyperplane separating two sets of points. So, an SVM solver can be used to find . A simpler algorithm is projection method, as descripted in paper.

Max Margin Planning and Learn to Search

[4] propose maximum margin planning method to do imitation learning. It frames the IL problem as a maximum margin structured prediction problem over a space of policies.

The input to the algorithm is a set of training instances . - : State space - : Action space - : transition probabilities - : d-dimensional feature vectors in the form of - : the desired trajectory - , here, . - Intuitively, a loss vector is placed over state-action pairs that defines for each state-action pair how much the learner pays for failing to match the behavior of an example policy as the cumulative loss of the learned policy through this MDP.

Maximum Margin Planning constrained optimization formulation:

In fact, inverse reinforcement learning is an ill-defined problem: there are many optimal policies that can explain a set of demonstrations, and many rewards that can explain an optimal policy. Max-Ent IRL is to handles the former ambiguity.

Maximum Entropy Principle

Entropy is an old concept in physicals that is used to describe the randomness in environments. The greater the entropy, the more random the actions the policy gives. The discrete form of entropy is:

Similarity, the entropy term for policy has this form:

The entropy term of policy can help the policy to increase the expoloration ability, by adding more possibilities to some rare actions. That could avoid the agent get stuck into local optimum and attain global optimum, to some extend. The idea of learning such maximum entropy model has its origin in statistical modeling, in which the goal is to find the probability distribution that has the highest entropy while still satisfying the observed statistics [1]. The principle of maximum entropy states that the probability distribution with the highest entropy, is the one that best represents the current state of knowledge in the context of precisely stated prior data [2].

Apprenticeship Learning Using Linear Programming

[5] One contribution, if one uses the linear programming approach for finding an MDP's optimal policy as a subroutine, then one can modify [6] algorithm so that it outputs a stationary policy instead of a mixed policy. Second contribution is the formulation of the apprenticeship learning problem as a linear program.

We say a policy is -optimal if . A policy has occupancy measure if

$𝟙$

for all . The goal of apprenticeship learning is to find an apprentice policy such that .

A Game-Theoretic Approach to Apprenticeship Learning

[6] propose a method to produce a policy that may substantially better than the expert's based on a game-theoretic view of the problem. They pose the problem as learning to play a two-player zero-sum game in which the apprentice chooses a policy, and the environment chooses a reward function. The goal of the apprentice is to maximize performance relative to the expert, even though the reward function may be adversarially selected by the environment with respect to this goal.

References

[1] @article{haarnoja2017reinforcement, title={Reinforcement learning with deep energy-based policies}, author={Haarnoja, Tuomas and Tang, Haoran and Abbeel, Pieter and Levine, Sergey}, journal={arXiv preprint arXiv:1702.08165}, year={2017} }

[2] https://towardsdatascience.com/entropy-regularization-in-reinforcement-learning-a6fa6d7598df

[3] @inproceedings{abbeel2004apprenticeship, title={Apprenticeship learning via inverse reinforcement learning}, author={Abbeel, Pieter and Ng, Andrew Y}, booktitle={Proceedings of the twenty-first international conference on Machine learning}, pages={1}, year={2004} }

[4] @inproceedings{ratliff2006maximum, title={Maximum margin planning}, author={Ratliff, Nathan D and Bagnell, J Andrew and Zinkevich, Martin A}, booktitle={Proceedings of the 23rd international conference on Machine learning}, pages={729--736}, year={2006} }

[5] @inproceedings{syed2008apprenticeship, title={Apprenticeship learning using linear programming}, author={Syed, Umar and Bowling, Michael and Schapire, Robert E}, booktitle={Proceedings of the 25th international conference on Machine learning}, pages={1032--1039}, year={2008} }

[6] @inproceedings{syed2008game, title={A game-theoretic approach to apprenticeship learning}, author={Syed, Umar and Schapire, Robert E}, booktitle={Advances in neural information processing systems}, pages={1449--1456}, year={2008} }