Tale of etcd upgrade

This blog post is about how a simple straightforward task never goes according to plan in a distributed system. Always expect the unexpected. I work at Microsoft, Azure Kubernetes team, and was recently asked to investigate upgrading legacy etcd v2(2.2) to v3(3.2). We run a large number of etcd clusters in production, spread across multiple regions, and many clusters are still running v2. Running an N master configuration, you can imagine the complexity of upgrading a large number of etcd instances with zero downtime.