How to Become Google, Amazon, or Facebook (In Theory)
At Apigee, we are doing some phenomenal things. More importantly, because of our work with customers, we have a bird's eye view of the digital revolution. As CTO, I get to participate in these incredible things, and it gives me a perspective on technology and business trends. Those of you who know me understand that I’m not shy about expressing my opinions. Opinions, while free to give, actually help me think clearly (the art of writing is very clarifying), and, hopefully, are also useful to the reader.
So here’s our new blog channel, "CTO Musings." I’m retiring my personal musings (some would argue I had already retired it years back). One of the tests I have for blogging is that one in five blogs should cause me to think: "Shoot, should I have said that?" In other words, I should be a bit controversial, and occasionally over the edge. Keep me honest here folks—if you find posts are too bland, the corporate lawyers have gotten to me, and call me out (This itself qualifies for that one in five).
Most importantly, we should learn together. I hope that with the dialog we generate, knowledge advances.
We admire Google, Amazon, Azure, Netflix, Facebook, LinkedIn, Twitter, Swift. How many more of these are there? Hundreds? Perhaps. Thousands? No way. Why? Because it’s bloody hard to build and operate systems like these.
As we have built Apigee Edge, and would like to be considered in the hundreds (after all, we do more than 300 billion calls/year with five 9s of availability for thousands of our customers in a Chaos Monkey environment), we’ve also realized that it has taken us more than four years to get here, and the lessons we have learned are impossible to learn in less time and at a smaller scale. That’s why few get to this point. I’m not talking of computer science principles here (though there are plenty of them too, but that’s a topic for a separate blog post).
There are five things that are very difficult to get right in any large-scale distributed system built by a dev team of more than 50.
Dev is fun. Creating something out of nothing is, in my mind, the purest form of art. Ops is bloody. DevOps is great in theory, but very difficult in practice. Your most creative minds create; they don’t operate. Pager duty? Yuck.
The classic iterative model fails miserably at scale. “Let us first build the system using an RDBMS, and later on we can switch the RDBMS to be Cassandra.” It doesn’t work that way. First, the abstractions aren't clean enough to be switchable. But more importantly, the application semantics that consume the persistence come to rely on the underlying system. To scale, one must anticipate scale. That is very hard.
Change at scale is not easy. One can lock down the system and satisfy the safety property of systems—or so people think. In fact, there have been some excellent writeups about how locking down is actually antithetical to safety, but this is a difficult lesson to learn. Trial by fire isn’t easy. But the progress property is even more difficult. How does one innovate, and let one’s customers innovate, and yet keep things running? At Apigee, we do hundreds of changes every month, our customers deploy thousands of new APIs every month, and the infrastructure keeps changing underneath us. Yet guaranteeing availability, scale, and latency for our customers’ APIs is a task that we have learned carefully.
A common style of development teams is “let’s focus on features and the cost will work out by itself.” We have found this to be helpful: ask each team what the baseline cost is (for example, if one is building persistence on AWS, start with the AWS GB/month cost). Now do a layering, and tell us what each added functionality costs. Without this, it is all jumbled; it’s no way to build distributed systems at scale.
This, of course, is not technology, though technology helps. Clean APIs between team microservices help. But dependencies naturally arise—at the very least, when some teams are providing platform services—such as core persistence services at Apigee. Furthermore, there are always tensions: commonality of skills (can one team use MySQL and another, Postgres?) or common look and feel (do you want your surface area to look like Frankenstein?).
“Architectural committees” are considered anathemas. There is no magic wand here—it all depends on your culture and your skills. I am not going to recommend a preferred way. Suffice it to say, though, if you do not get this right, you cannot sustain your play.
Eventually, we will get into the top 10, but that is still a few years away, with many more lessons to be learned.