Those of you who have done cloud deployments know that Murphy’s Law is always right around the corner. You’ve tested everything, and you know more testing will only make you non-competitive. So let’s push this release to the cloud. Then BAM—all hell breaks loose. Your boss is screaming at you. You scramble, unroll, and take another month to test your stuff.
After all, screaming bosses are better (marginally) than screaming customers. But now you’ve taken so long that your competitors capture the key market that you wanted to win in.
Staying ahead of error creep
Apigee is a large scale cloud deployment, with over a dozen teams building and releasing capabilities to hundreds of enterprise customers and thousands of small business and trial customers. We’ll discuss our build/test/deploy model separately, but we had to get ahead of the problem of “errors creeping up in new deployments,” or we’d be dead.
To that end, a team built out the blue/green deployment model for Apigee software. Assume we have a current release, say 16.04.20 (the red and mustard green in the figure below). Assume that the customer has two MPs. We want to switch over the two MPs to a new release: 16.05.14.
We can stand up two new message processors (the magic of cloud and the seeming infiniteness of compute capacity)—shown in green and purple below—load them with new software, and automatically switch over. Bad idea, right? Why not send some of the traffic to the new MPs and see if the error rates are changing for the worse. If not, send more traffic to them, and eventually send all the traffic over to the new MPs.
Invisibility isn’t magic
That is what blue/green is about. In the graph below, see how we incrementally send more traffic to the new MPs (our load balancing is pretty good, so often the mustard green and red lines overlap, and so do purple and green). And right below it, see how the customer’s traffic sees no effect? Magic, right? Not really—just tons of hard work behind the scenes. Thanks team!
The graph below shows that while we were switching the traffic from one set of servers to another, adding new servers and removing old ones, it was completely invisible to the API clients.
The traditional approach took out part of the capacity, upgraded it, and then added it back to the live pool. This approach has the following limitations:
- It restricts the release window, as you have to find the low traffic period.
- It requires you to pray that the cloud provider does not terminate any remaining servers during the release window.
- When an issue is encountered, rollbacks often have to be carried out during the peak hour as that’s when customers tend to notice regressions.
Reliability, speed, and empowerment
The blue/green approach overcomes these limitations, as we are constantly adding capacity, going through low and peak traffic to ensure things look good with the new release, and then slowly taking down the older release.
Now would a customer care about this?
- We deliver better, reliable software at a higher velocity.
- Over time, we will expose this capability to customers, so they can employ blue/green deployment for their API changes on our infrastructure.
Welcome to the world of cloud!
You might wonder how can we monitor all the dashboards at once and how we decide that a new release isn’t causing issues. We’ll cover the automated feedback loop in a future post.
Image: Flickr Creative Commons/LHOON