Multicloud: Taming the Rookery
It was late. I was with my old platform team in a previous job, working with a client development group on a design for a fairly straightforward migration from a local application to a cloud service, and we couldn’t come up with an elegant design the client would agree to.
There were just too many systems, too much spaghetti code — it seemed like a wholesale cutover (along with associated significant downtime) was the best way to do it. The planned downtime would be bad for business and bad for the customer experience, but there just didn’t seem to be another way. The client architects and developers felt that it was just not feasible to “lift & shift” the back end system(s) without taking time for the “lift;” when the data was in-transit, so to speak, there was no way to take traffic. Even though we stated the inevitable consequences, this was the required path that the “architecture review board” approved.
So, that was the path we took. It was painful—expected downtime of eight hours turned into almost 24 hours, because of configuration issues with downstream systems (why didn’t anyone document IP whitelisting in the modulated frogenator?*). By the time the problem was solved, customers were frustrated, as downstream consumers weren’t getting the alerts they signed up for, stores weren’t getting the foot traffic, and the stores were doubtlessly looking for alternatives.
If I had to write a Root Cause Analysis in a post mortem for this event, it would be a one-liner: “The hairy backend systems were just too hairy.”
When we think of all of our backend systems—some of which may span hundreds or even thousands of services—the image of multiple albatross comes to mind. When looking up the term for a group of albatross, my Googling landed me on the word: it’s called a “rookery.”
Revised Root Cause Analysis: “too many albatross in the rookery.”
Sometimes it seems like the multitude of interdependencies that have been built in monolithic backend systems—frequently depicted as an onion of layers, from the network core to various networking layers at the edge—is too great to assess and fully document. That sort of project can sometimes take person-years of effort, is prone to human error, and may still result in service interruption. The ever-pressing need to move to the cloud and deliver cloud-based platforms that support platform business models cannot wait person-years for a documentation effort.
There has to be a better way…