Tale of etcd upgrade

This blog post is about how a simple straightforward task never goes according to plan in a distributed system. Always expect the unexpected. I work at Microsoft, Azure Kubernetes team, and was recently asked to investigate upgrading legacy etcd v2(2.2) to v3(3.2). We run a large number of etcd clusters in production, spread across multiple regions, and many clusters are still running v2. Running an N master configuration, you can imagine the complexity of upgrading a large number of etcd instances with zero downtime.

Actions performed across the fleet include:

Pre-Monitoring

Having solid monitoring on our end is the critical piece in any operation performed against production. This includes not just monitoring etcd itself, but also the application depending on etcd (in our case Kubernetes clusters). This also brings confidence in our pre and post upgrade processes.

Detection

Any upgrade process involves a detection phase:

Like any other team out there, we didn’t think it was important to log this into our monitoring data. Let’s be honest, it’s something we need to do one-off (at least that’s what we thought initially). No sane engineer would log in to large number of instances and check the running version. One of our engineers in the team had installed and configured rundeck, which can run commands, scripts, perform custom orchestration…a long list of features(it’s pretty awesome).

Testing

Being a decent engineer, before starting the complex upgrade process, l begin an extensive research process on upgrade process. Armed with this half-baked knowledge, I set out to test the process on our test etcd cluster which was running 2.2. I successfully upgraded to 2.3 using a simple script run on each machine, because that’s the recommended process.

The upgrade process went smoothly, so with rushing over confidence and battle tested scripts I went ahead with upgrade from 2.3 to 3.2. This upgrade ended in disaster (oh! the horror),

Replacing it with v2.3 or v3.1 didn’t help either. I didn’t try 3.0 since I remembered reading about data corruption issue (it applied only to v3 store). A frantic search led me to known issue. After taking a small break, went back to drawing board and read the upgrade guide again, moved etcd to v3.0, which safely started etcd.

This brought us back to detection phase to check for etcd v2 store in current etcd v3 clusters (there were none). The upgrade process involved:

  1. v2.2 to v2.3
  2. v2.3 to v3.0
  3. v3.0 to v3.1 (surprisingly doesn’t panic)
  4. Offline migration from v2 to v3 store, because online migration was just too exciting.
  5. v3.1 to v3.2

There was one problem with step 4. We had to shut down upstream application using etcd — in our case, Kubernetes (apiserver). Continuing my research came across the great following blog. It was time to whip out personal kubernetes cluster. The easiest approach I could find was having my personal acs engine cluster with etcd v2 backend.

I whipped out the following script for migration

  1. Check that etcd is healthy.
  2. Check to make sure etcd had no v3 keys (if yes error out)
  3. Backup v2 store
  4. Shutdown Kubernetes api server (disable manifests used by kubelet) on all masters
  5. Wait for etcd to converge raft indexes (could be easily done with etcdctl endpoint status)
  6. Shutdown etcd on all masters
  7. Run etcdctl migrate on all etcd cluster instances
  8. Start etcd on all masters
  9. Check etcd is healthy
  10. Backup v3 store
  11. Attach leases to keys in {ROOT}/events, i extracted required code to https://github.com/awesomenix/etcdattachlease
  12. Change — storage-backend from etcd2 to etcd3 option and start kubernetes api server
  13. Cleanup v2 store

The script worked flawlessly in our test etcd clusters. I wish we could have used official kubernetes etcd migration, since it also has rollback support, but it didn’t fit our needs.

Upgrade

Armed with solid script and successful testing of upgrading multiple private and test etcd clusters, I started official production upgrade process. Following safe deployment process (in our case upgrade), in order of small, medium, and large production regions. We quickly ran into a problem on 1 etcd clusters after 4 were upgraded. All kubelets flapped between Ready and NodeNotReady. A quick search revealed that kubernetes wasnt good at resolving conflicts in etcd, referenced in issue and fixed by issue in 1.9 and later. Since we were using 1.8, this wasn’t helpful at all. We caught this issue with our monitoring stack (kudos to excellent monitoring). Time to panic, since rolling back to v2 wasn’t easy. Also, we might have newer keys which needs to be migrated as well (which is handled cleanly in https://github.com/kubernetes/kubernetes/tree/master/cluster/images/etcd). In desperate times, you get some average good ideas.

I quickly logged on to master:

This recovered Kubernetes cluster and the nodes stopped flapping states, in some instances required restarting kubelet to recover. Disaster averted. We faced similar issue on 3 other clusters as well, and above procedure got us back online within 10 mins. Success rate of 99.94%, I’d take that any day.

Post-Monitoring

During the upgrade process we had monitoring for:

  1. Nodes getting into NotReady, which causes customers to be impacted due to kubernetes rescheduling pods regularly
  2. Customers getting affected during upgrades, any modifications to their stack might be affected
  3. Any other unexpected failures

After successful fleet wide etcd upgrades, we opted to inspec, making sure our infrastructure is up-to-date and compliant when new services/stack is introduced.

Conclusion

End this tale with a message, having solid monitoring will always help tackle any large production changes with high confidence.