Beyond awesome | 越而胜己

Problem

This paper [1] is a follow-up work of the DeepMind team’s previous work, AlphaGo [2]. AlphaGo solved the game of Go, which is considered the most challenging game for AIs because of the enormous search space. Because the game of Go has a 19x19 board, running vanilla minimax is unrealistic. By using Monte-Carlo Tree Search and two convolutional neural networks, AlphaGo was able to play the game of Go efficiently, and was so good that it beat world-champions Lee Sedol and Ke Jie. The problem with AlphaGo was that it required huge amounts of data from human experts. The resource requirements make AlphaGo hard to train, and also to some extent limits AlphaGo’s performance to human expert level.

Approach

Because the search tree is so big, the AlphaGo Zero model (and its predecessor) uses the Monte-Carlo Search Tree (MCTS) algorithm to only expand nodes with high probability. AlphaGo Zero improves upon AlphaGo, and introduced a new approach to train Go AIs without supervision at all. In contrast to AlphaGo, which trains agents to mimic the moves made by human expert players, AlphaGo Zero trains by self-playing, i.e. playing against itself. While the old AlphaGo model uses a policy network and a value network, the new model runs only one lookahead neural network. By self-playing, it trains the neural network to guide it to explore the Monte-Carlo search tree. With the help from [4], I was able to mostly figure out how the model works.

Contributions

This work shows that even in challenging domains, game-playing agents can be trained without human data at all, and can do even better than humans. In the authors’ own words, it demonstrates that “superhuman performance can be achieved without human domain knowledge.”

This paper also lays the foundation for the next major milestone in game-playing AI, AlphaZero [3]. AlphaZero generalizes AlphaGo Zero to other games, such as Chess and Shogi. The three Alpha* models together also provide a guide for future research on applying neural networks to game-playing.

Novelty, significance, results

This (world-famous) work is groundbreaking in the field of game-playing AI. It is the first work to show that game-playing AIs can learn by themselves without human input, and lays the foundation of future AI research. The new single-network architecture also simplifies the model. Empirically, the new model trained for only slightly longer than AlphaGo was able to beat the old model, which to some extent envisions a future where AIs might “outsmart” humans in at least some specific fields.

Strengths

One of the strengths of this work is that it integrates traditional game-playing strategies (tree search) with recent technology in deep learning, i.e. residual convolutional neural networks. The residual network takes care of feature engineering directly from the board state, making it highly effective and easy to engineer. The overall training process is also simple and easy to follow. The math explained in the paper is also well-explained and documented, even though it is a top-tier journal paper.

Critiques, Weaknesses

In spite of the great innovation the work brought to game-playing AIs, like most neural-network-based models, it is still very expensive to train. The authors mention that AlphaGo Zero was able to beat AlphaGo after 3 days of training, but that was with 64 GPUs; therefore, AlphaGo Zero is still too expensive for normal users to train. It would be nice if the authors could provide more details on parallel training for those who aren’t familiar in that area.

Reference

[1] Silver, D. et al. Mastering the game of Go without human knowledge. Nature 550, 354–359 (2017).

[2] Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, (2016).

[3] Silver, D. et al. Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm. (2017).

[4] Nair, S. Simple Alpha Zero. (2017).