Beyond awesome | 越而胜己

High Level Summary

In this paper, Newcombe, et al. discuss Amazon’s experience adopting formal methods on complex real-world systems at Amazon Web Services. They adopt the TLA+ language to write formal specifications at various abstraction levels, and find that TLA+ offers a lot of benefits in designing complex systems. They help prevent bugs, enable aggressive optimizations, improve understanding and productivity, etc. Aside from the benefits of adopting the formal specification language, the authors also talk in depth about the steps that Amazon took to adopt this formal method in 7 essential AWS services and advertise it to more teams.

Strengths

I really appreciate how the authors focus half of the paper on how Amazon started to adopt TLA+ and how it became popular with more and more groups at Amazon. It is interesting to see the “deceiving” tricks such as avoiding mentioning words like “formal”, “verification”, “proof”, etc. when persuading engineers to use TLA+. It seems reasonable, because these daunting buzzwords have a reputation for being “impractical.”

The paper also has great significance, because it shows that formal methods, commonly believed to only apply to smaller and simpler systems, can work very well for complex, real-world systems. The success of TLA+ at Amazon proves the practical value of formal methods in computer science. Hopefully, this will change formal methods’ reputation of being “impractical” such that in the future, companies won’t have to go through the same painful process of promoting such formal tools.

Issues

One limitation, as the authors mention themselves, is that the authors stopped looking for other formal solutions once they found TLA+. Indeed, I feel that the lack of comparison to other formal spec languages might make the paper more effective. However, this seems to be common for papers from the industry, which focus on practicality rather than academic breadth and exhaustiveness.

Also, although the authors justify the lack of TLA+ snippets with trying not to put off potential users, I find the article slightly abstract as I have no idea how TLA+ really looks like and works. Although the authors link resources to TLA+ tutorials, etc., I think it would be more convincing and helpful if the authors could include at least a high level depiction of what TLA+ would look like in AWS’s use cases.