0%

Notes on "AlphaGo Zero"

AlphaGo Zero

AlphaGo Zero introduced an algorithm based solely on reinforcement learning without human data, guidance or domain knowledge beyond game rules. This is unlike AlphaGo, which trained its neural networks by supervised learning from human expert moves, and by reinforcement learning from self-play.

AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo’s own move selections and also the winner of AlphaGo’s games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration.

  • AlphaGo Zero trained solely by self-play reinforcement learning;
  • It uses only the black and white stones from the board as input features;
  • It uses a single neural network, rather than separate policy and value networks;
  • It uses a simpler tree search that relies upon this single neural network to evaluate positions and sample moves, without performing any Monte Carlo rollouts.

Reinforcement learning in AlphaGo Zero

The new method uses a deep neural network with parameters .

The vector of move probabilities represents the probability of selecting each move (including pass), ; the value is a scalar evaluation, estimating the probability of the current player winning from position .

The MCTS uses to guide its simulations. The MCTS can be viewed as a self-play algorithm.

References

[1] @article{silver2017mastering, title={Mastering the game of go without human knowledge}, author={Silver, David and Schrittwieser, Julian and Simonyan, Karen and Antonoglou, Ioannis and Huang, Aja and Guez, Arthur and Hubert, Thomas and Baker, Lucas and Lai, Matthew and Bolton, Adrian and others}, journal={nature}, volume={550}, number={7676}, pages={354--359}, year={2017}, publisher={Nature Publishing Group} }