Google DeepMind’s AlphaGo won 5-game challenge series 4-1 overall in Seoul against the world-class player Lee Se-dol, the top Go player in the world over the past decade. The video streams of the matches are on Youtube. The paper “Mastering the game of Go with deep neural networks and tree search” published in Nature on 28th January 2016, describes the technical details behind the AlphaGo.

The paper has 20 authors. Among them, 18 are from Google DeepMind located in UK, which was acquired by Google a few years back, and the other two are from Google main campus. The first two authors’ contributions to the paper are claimed to be equal.

**1. Problem to solve
**

The main problem that the paper deals with is the enormous search space Go game presents.

**2. Background**

In general, for this kind of problem the search space is huge if we want to look at all legal moves for all possible sequences. So the search space should be reduced. Existing algorithms can reduce the the depth of search by replacing the subtree of a node with an approximate value, instead of expanding and computing the subtree. In addition, the width of the tree can also be reduced by using sampling. The sampling is based on a probability distribution. The probability distribution, also known as the policy or action policy, can be learned. The probability distribution function gives the likelihood of the potential actions.

Problem: the probability distributions generated by existing approaches are too simple and coarse. The patterns on the play board are linearly combined. They couldn’t represent the nonlinear relationships between the moves.

**3. Key ideas/technologies**

In this paper, the Go game board is treated as a 19×19 image. Then different positions in Go games are treated as images. The paper has two convolutional neural networks. One outputs the probability of each legal move and the other one outputs the potential value of each move. The probability represent how likely a professional will take that move and the value gives the prediction of the final outcome of the move with the trained policy.

The first neural network outputting probabilities, is called policy network that is trained in two steps.

**First**, **supervised learning (SL)**, expert human moves (textbook moves, known previous games, and moves demonstrated by experts, etc) are used to train a **policy network**. The policy network is a neural network with the image of a position (image of the board with some whites and blacks) as a input and the probability of all potential moves as the output. The neural network has 13 layers and the output layer actuation functions are softmax (a typical actuation function to output confidence/probability of each legal move). The training data set is called KGS data set that has 30 million positions and 30 million expert moves responding to the positions. They are from 160,000 games played by KGS 6 to 9 dan human players. The authors applied many smart tricks to increase the training data set size, including the rotations and eight reflections of the positions in the data set. Stochastic gradient ascent is used to backpropagate the error and train the weights in the network.

The training took 3 weeks for 340 million training steps with mini-batch size of 16 and adaptive step-sizes on 50 GPUs. At the end of training, the neural network was capable to predict expert moves with an accuracy of 55.7% using a current board position and the move history as inputs, and a little bit higher if adding more features. The authors also tried a smaller network with linear softmax to increase speed, but it has less accuracy.

After this SL training, the neural net is good at identifying position patterns on the board and give a probability to each legal move. The move corresponding to the highest probability should be very similar to the move played by a professional. At this point, the AlphaGo should have a good chance to beat amateurs, but not top professionals. The position patterns the AlphaGo can identify after SL training is bounded by the patterns seen by professionals.

**Then**, based on the learned policy network, the training continues with **reinforcement learning (RL)**. The weights learned from the previous step are used as the initial values. The reward is defined to be +1 for winning and -1 for losing at the end of the game that is played in simulation between the current policy network and the policy network from a randomly selected previous iteration (similar to you play with yourself after a time travel back to the past). Basically, the policy networks from previous iterations form an opponent pool. The current policy network play some of them in parallel. According the the award, the weights are adjusted using the reinforce algorithm. After 500 iterations, the current policy network is added into the opponent pool. This playing-with-its-younger-versions went through 10,000 mini-batches of 128 games (in parallel) for one day using 50 GPUs. At the end of training, the outcome policy network after RL won more than 80% of games against the policy network after SL and before RL.

The RL allows the AlphaGo to evolve by itself and not to be bounded by the existing plays in any data set. It would be interesting to compare the positions in the 1.3 million simulated games with the positions in KGS data set.

**The second convolutional neural network** is called **value network** that has almost the same structure as the policy neural network other than the output layer. Instead of using softmax that is for probability output, the value network uses tanh as the actuation functions for the output layer. The outputs are the values for all legal moves. The value estimates how likely a move will eventually lead to a win after a sequence of perfect moves. Similar stochastic gradient descent is used to search for weights in the network.

To deal with overfitting problem, the authors generated a new self-play dataset of 30 million distinct positions, and each sampled from a separate game (to make sure the positions are independent). Each game was played between the trained RL policy network and itself until the game terminated to give win/lose training labels. The positions and the results are then used to train the prediction model. This step went through 50 million mini-batches of 32 positions using 50 GPUs and finished in one week.

Now, after training the two convolutional neural nets, both the probability and the value of each move for a position can be computed from them. It is really for Monte Carlo tree search (MCTS).

The paper has more details on the structures of the networks, the computation of the gradients and the training tricks.

**4. Results**

Well, we all know the results.

**5. Ideas**

There is no doubt that both networks can be improved over time with more training, more simulated plays for reinforcement learning, and better tuning of the training. The learning is not only on the existing moves of known positions, but also on simulated moves and positions. So AlphaGo would have seen many more position patterns than any human player.