How Apigee Handles High-Volume Customer Events
Every experience we have with our customers is a learning experience; we benefit and learn, and, in turn, fine tune our offerings. In the first post in this series, we examined a recent customer security breach, how Apigee handled it, and what we learned. Here, we’ll take a look at how we prepare for and handle high-volume customer events, the kind that are particularly common among retailers during the holiday shopping season.
In our cloud offering, we run tens of billions of API calls per month that access our customer’s back-end systems. Our on-premises software probably handles an equally large number. These volumes have affected the choices we've made regarding architectural principles, building blocks, multi-tenancy, data architectures, operational tools, and a multitude of other big and small capabilities.
One experience we’d like to share is how we handled Black Friday for our customers. We have many large retailers who often describe Black Friday as their “Super Bowl.” Retailers did well from a mobile perspective: at the peak, we were handling roughly 50 million calls per hour.
We didn’t suffer any performance degradation in the API layer, let alone any outage. In theory, the software is always either “stateless” (the failure of one node can be re-routed to another), or whatever is “stateful” (in our case, keys, tokens, and policies) is replicated in such a way as to handle a sufficiently large number of failures. As you all know, the gap between theory and practice is “reality.”
There were three principles around which we organized our monitoring. We were able to implement the first two, but couldn’t get to the third one.
Converting noise to signal to avoid information overload
Our systems are flowing with data—aren’t yours? What we needed for Black Friday was to convert the noise and chatter to signals, and enable a one-stop way to focus on the right signals when problems happened (Murphy’s Law was never far from our minds). While every part of our system generates data, we have three systems that collect data: Riemann for monitoring data, Sumo Logic for logging alerts, and Edge Analytics for historic and medium-term dashboards.
We had standing queries on these systems to generate alerts. On Riemann, it might be a signal generator: “memory running out;” on Sumo it might be “too many 5XX errors in the last few minutes;” and in Edge Analytics it might be “back-end latency exceeds last week’s average by more than 60%.” From a purely operational monitoring perspective, we were mostly interested in the results of these standing queries and alerts—we deemed the raw information to be clutter.
All relevant information in one place
Jumping from system to system is time-consuming and frustrating; we need to make sure that for DevOps, all the relevant information is available in one place. To accomplish this, we built several new tools: Pinpoint, G-Dash (global dashboard) and T-Dash (don’t ask us what the “T” stands for—let’s just say it’s someone’s vanity at work).
PinPoint is a searchable database that gives us fingertip access to deep detail about infrastructure inventory combined with real-time data on component and back-end API health.
G-Dash gives us auto-refreshing, wide-ranging system health metrics, from request/response counts to CPU, memory, and connections across all components that deliver API runtime traffic based on stats we’ve collected.
T-Dash gives us auto-refreshing system health metrics on request rate, latency, and errors specifically from ELBs, based on metrics collected AWS CloudWatch.
We also realized that HipChat is very useful for conducting a dialog between multiple people, so we took pains to create multiple HipChat rooms that avoided information clutter for a set of folks discussing a specific customer. We also created a “catch-all” HipChat room for all other issues, or even for people to poke fun at others who might be lurking. From a tooling perspective, we integrated some signals, including Sumo Logic alerts, directly into HipChat.
The importance of three correlating signals
Debugging problems requires “correlations” between different signal streams, and auto-correlations within one signal stream. An out-of-memory when it occurs with an increase in traffic is different than an out-of-memory that happens independently. And a sequence of signals within one stream (or across multiple streams) is more indicative of problems than just viewing each signal as independent event. We could not implement this in time.
With careful, round-the-clock monitoring, our customers were successful on Black Friday, and that’s all that matters. We’ll put some of these tools into GitHub, and of course we’ll continue to provide more details on other aspects of our engineering.
Apigee is hiring. If you’re an engineer excited about tackling challenging problems, visit our careers page.