AlphaGo
AlphaGo is a successful application of (deep) reinforcement learning in board games.
Difficulties and Current Methods
The search space of game Go is extremely huge, it's about
- The depth of the search may be reduced by position evaluation
- The breadth of the search may be reduced by sampling actions from a
policy
that is a probability distribution over possible moves in position .
Monte Carlo tree search (MCTS) uses Monto Carlo rollouts to estimate the value of each state in a search tree. As more simulations are executed, the search tree grows larger and the relevant values become more accurate. The strongest current (now old) Go progams are based on MCTS.
AlphaGo
AlphaGo use neural networks to reduce the effective depth and breadth of the search tree: evaluating positions using a value network, and sampling actions using a policy network.
- AlphaGo begin by training a supervised learning (SL) policy network
directly from expert human moves; - It also train a fast policy
that can rapidly sample actions during rollouts; - Next, we train a reinforcement learning (RL) policy network
that improves the SL policy network by optimizing the final outcome of games of self-play; - It train a value network
that predicts the winner of games played by the RL policy network against itself.
Supervised learning of policy networks
The SL policy network
AlphaGo also trained a faster but less accurate rollout policy
Reinforcement Learning of Policy Networks
This stage aims at improving the policy network by policy gradient
reinforcement learning. The RL policy network
Reinforcement learning of value networks
The final stage of the training pipeline focuses on position
evaluation, estimating a value function
We train the weights of the value network by regression on
state-outcome pairs
The naive approach of predicting game outcomes from data consisting of complete games leads to overfitting, for the reason that successive positions are strongly correlated. To mitigate this problem, we generated a new self-play data set.
Searching with policy and value networks
AlphaGo combines the policy and value networks in an MCTS algorithm
that selects actions by lookahead search. At each time step
so as to maximize action value plus a bonus
where the outcome
At the end of simulation, the action values and visit counts of all traversed edges are updated. Each edge accumulates the visit count and mean evaluation of all simulations passing through that edge
where
References
[1] @article{silver2016mastering, title={Mastering the game of Go with deep neural networks and tree search}, author={Silver, David and Huang, Aja and Maddison, Chris J and Guez, Arthur and Sifre, Laurent and Van Den Driessche, George and Schrittwieser, Julian and Antonoglou, Ioannis and Panneershelvam, Veda and Lanctot, Marc and others}, journal={nature}, volume={529}, number={7587}, pages={484--489}, year={2016}, publisher={Nature Publishing Group} }