Tale of etcd upgrade

Posted on Jul 10, 2019

This blog post is about how a simple straightforward task never goes according to plan in a distributed system. Always expect the unexpected. I work at Microsoft, Azure Kubernetes team, and was recently asked to investigate upgrading legacy etcd v2(2.2) to v3(3.2). We run a large number of etcd clusters in production, spread across multiple regions, and many clusters are still running v2. Running an N master configuration, you can imagine the complexity of upgrading a large number of etcd instances with zero downtime.

Actions performed across the fleet include:

Pre-Monitoring
Detection
Testing
Upgrade
Post-Monitoring

Pre-Monitoring

Having solid monitoring on our end is the critical piece in any operation performed against production. This includes not just monitoring etcd itself, but also the application depending on etcd (in our case Kubernetes clusters). This also brings confidence in our pre and post upgrade processes.

Detection

Any upgrade process involves a detection phase:

etcd versions
etcd backend store, we needed to upgrade from v2 to v3 (mvcc) store (realized at a later point)
etcd security, making sure it’s using TLS (realized at a later point)

Like any other team out there, we didn’t think it was important to log this into our monitoring data. Let’s be honest, it’s something we need to do one-off (at least that’s what we thought initially). No sane engineer would log in to large number of instances and check the running version. One of our engineers in the team had installed and configured rundeck, which can run commands, scripts, perform custom orchestration…a long list of features(it’s pretty awesome).

Testing

Being a decent engineer, before starting the complex upgrade process, l begin an extensive research process on upgrade process. Armed with this half-baked knowledge, I set out to test the process on our test etcd cluster which was running 2.2. I successfully upgraded to 2.3 using a simple script run on each machine, because that’s the recommended process.

Check etcd is healthy
Detect etcd version (if 2.2 upgrade to 2.3)
Backup v2 store
Download and replace with 2.3
Restart etcd
Check that etcd is healthy
Wait for two minutes

The upgrade process went smoothly, so with rushing over confidence and battle tested scripts I went ahead with upgrade from 2.3 to 3.2. This upgrade ended in disaster (oh! the horror),

etcd would not start
Always ended up with panic due to missing snap file.

Replacing it with v2.3 or v3.1 didn’t help either. I didn’t try 3.0 since I remembered reading about data corruption issue (it applied only to v3 store). A frantic search led me to known issue. After taking a small break, went back to drawing board and read the upgrade guide again, moved etcd to v3.0, which safely started etcd.

This brought us back to detection phase to check for etcd v2 store in current etcd v3 clusters (there were none). The upgrade process involved:

v2.2 to v2.3
v2.3 to v3.0
v3.0 to v3.1 (surprisingly doesn’t panic)
Offline migration from v2 to v3 store, because online migration was just too exciting.
v3.1 to v3.2

There was one problem with step 4. We had to shut down upstream application using etcd — in our case, Kubernetes (apiserver). Continuing my research came across the great following blog. It was time to whip out personal kubernetes cluster. The easiest approach I could find was having my personal acs engine cluster with etcd v2 backend.

I whipped out the following script for migration

Check that etcd is healthy.
Check to make sure etcd had no v3 keys (if yes error out)
Backup v2 store
Shutdown Kubernetes api server (disable manifests used by kubelet) on all masters
Wait for etcd to converge raft indexes (could be easily done with etcdctl endpoint status)
Shutdown etcd on all masters
Run etcdctl migrate on all etcd cluster instances
Start etcd on all masters
Check etcd is healthy
Backup v3 store
Attach leases to keys in {ROOT}/events, i extracted required code to https://github.com/awesomenix/etcdattachlease
Change — storage-backend from etcd2 to etcd3 option and start kubernetes api server
Cleanup v2 store

The script worked flawlessly in our test etcd clusters. I wish we could have used official kubernetes etcd migration, since it also has rollback support, but it didn’t fit our needs.

Upgrade

Armed with solid script and successful testing of upgrading multiple private and test etcd clusters, I started official production upgrade process. Following safe deployment process (in our case upgrade), in order of small, medium, and large production regions. We quickly ran into a problem on 1 etcd clusters after 4 were upgraded. All kubelets flapped between Ready and NodeNotReady. A quick search revealed that kubernetes wasnt good at resolving conflicts in etcd, referenced in issue and fixed by issue in 1.9 and later. Since we were using 1.8, this wasn’t helpful at all. We caught this issue with our monitoring stack (kudos to excellent monitoring). Time to panic, since rolling back to v2 wasn’t easy. Also, we might have newer keys which needs to be migrated as well (which is handled cleanly in https://github.com/kubernetes/kubernetes/tree/master/cluster/images/etcd). In desperate times, you get some average good ideas.

I quickly logged on to master:

Shutdown kubernetes api server
Backup v3 store on current etcd master
Shutdown etcd on all master instances
Copy over the backup to other slave instances
Perform disaster recovery using v3 backup
Start etcd on all master instances
Start kubernetes api server

This recovered Kubernetes cluster and the nodes stopped flapping states, in some instances required restarting kubelet to recover. Disaster averted. We faced similar issue on 3 other clusters as well, and above procedure got us back online within 10 mins. Success rate of 99.94%, I’d take that any day.

Post-Monitoring

During the upgrade process we had monitoring for:

Nodes getting into NotReady, which causes customers to be impacted due to kubernetes rescheduling pods regularly
Customers getting affected during upgrades, any modifications to their stack might be affected
Any other unexpected failures

After successful fleet wide etcd upgrades, we opted to inspec, making sure our infrastructure is up-to-date and compliant when new services/stack is introduced.

Conclusion

End this tale with a message, having solid monitoring will always help tackle any large production changes with high confidence.