0%

Policy Iteration and Value Iteration

Policy Iteration and Value Iteration

Policy iteration and value iteration are two iteration approaches to obtain the optimal policy when the environment dynamics are known for agent.

Policy Iteration [1]

  • Start with a random policy;
  • Then find the value function of that policy (policy evaluation step);
  • Then find a new (improved) policy based on the previous value function (policy improvement step).

In this process, each policy is guaranteed to be a strict improvement over the previous one (unless it is already optimal). Given a policy, its value function can be obtained using the Bellman operator.

Value Iteration [1]

  • Start with a random value function;
  • Then find a new (improved) value function in an iterative process, until reaching the optimal value function.

The optimal policy can be easily derived from the optimal value function. This process is based on the optimality Bellman operator*.

References

[1] https://stackoverflow.com/questions/37370015/what-is-the-difference-between-value-iteration-and-policy-iteration

[2]