0%

Notes on "AlphaGo"

AlphaGo

AlphaGo is a successful application of (deep) reinforcement learning in board games.

Difficulties and Current Methods

The search space of game Go is extremely huge, it's about possible sequences of moves, where is the game's breadth (number of legal moves per position) and is its depth (game length). Exhastive search is infeasible but the effective search space can be reduced by two general principles.

  • The depth of the search may be reduced by position evaluation
  • The breadth of the search may be reduced by sampling actions from a policy that is a probability distribution over possible moves in position .

Monte Carlo tree search (MCTS) uses Monto Carlo rollouts to estimate the value of each state in a search tree. As more simulations are executed, the search tree grows larger and the relevant values become more accurate. The strongest current (now old) Go progams are based on MCTS.

AlphaGo

AlphaGo training pipeline and Architecture

AlphaGo use neural networks to reduce the effective depth and breadth of the search tree: evaluating positions using a value network, and sampling actions using a policy network.

  • AlphaGo begin by training a supervised learning (SL) policy network directly from expert human moves;
  • It also train a fast policy that can rapidly sample actions during rollouts;
  • Next, we train a reinforcement learning (RL) policy network that improves the SL policy network by optimizing the final outcome of games of self-play;
  • It train a value network that predicts the winner of games played by the RL policy network against itself.
Monte Carlo Tree Search in AlphaGo

Supervised learning of policy networks

The SL policy network alternates between convolutional layers with weights , and rectifier nonlinearities. A final softmax layer outs a probability distribution over all legal moves . The policy network is trained on randomly sampled state-action pairs , using stocahstic gradient ascent to maximize the likelihood of the human move selected in state

AlphaGo also trained a faster but less accurate rollout policy , using a linear softmax of small pattern features (see Extended Data Table 4) with weights

Reinforcement Learning of Policy Networks

This stage aims at improving the policy network by policy gradient reinforcement learning. The RL policy network is identical in structure to the SL policy network, and its weights are initialized to the same values, . AlphaGo play games between the current policy network and a randomly selected previous iteration of the policy network. A reward is used here. It's for all non-terminal time steps . The outcome is the terminal reward at the game: for winning and for losing.

Reinforcement learning of value networks

The final stage of the training pipeline focuses on position evaluation, estimating a value function that predicts the outcome from position of games played by using policy for both players

We train the weights of the value network by regression on state-outcome pairs , using stochastic gradient descent to minimize the mean squared error (MSE) between the predicted value , and the corresponding outcome

The naive approach of predicting game outcomes from data consisting of complete games leads to overfitting, for the reason that successive positions are strongly correlated. To mitigate this problem, we generated a new self-play data set.

Searching with policy and value networks

AlphaGo combines the policy and value networks in an MCTS algorithm that selects actions by lookahead search. At each time step of each simulation, an action is selected from state

so as to maximize action value plus a bonus

. A leaf evaluation is

where the outcome of a random rollout played until terminal step using the fast rollout policy .

At the end of simulation, the action values and visit counts of all traversed edges are updated. Each edge accumulates the visit count and mean evaluation of all simulations passing through that edge

where is the leaf node from the th simulation, and indicates whether an edge was traversed during the th simulation.

References

[1] @article{silver2016mastering, title={Mastering the game of Go with deep neural networks and tree search}, author={Silver, David and Huang, Aja and Maddison, Chris J and Guez, Arthur and Sifre, Laurent and Van Den Driessche, George and Schrittwieser, Julian and Antonoglou, Ioannis and Panneershelvam, Veda and Lanctot, Marc and others}, journal={nature}, volume={529}, number={7587}, pages={484--489}, year={2016}, publisher={Nature Publishing Group} }