Rearchitecting Analytics @ Apigee
When we first started to build out the analytics layer for Apigee, we made a set of important decisions based on expertise that turned out to be pretty good:
- Our team was very familiar with Postgres (and so was I—let’s say I am biased :) ).
- We were able to do a good multi-tenant design. Customers got their own tables—all on a shared database infrastructure.
- We knew how to balance writes and reads, and built what turned out to be a wonderful, index-less model for the key API traffic “fact table.” This also led to very predictable query performance, which was only dependent on the time range requested in the query.
- We leveraged Postgres’s parent-child capabilities to easily delete older records without incurring any performance penalties.
- We set up both aggregation jobs and Amazon Redshift to enable fast query responses.
We could not get to sharding; we had a design, but we threw more hardware at the problem and avoided creating a parallel, multi-tenant, sharded Postgres system. This was the core design for our analytics capability over the past few years.
And then came the obvious issues:
- As we grew, we needed to scale. Scale requires sharding.
- Redshift prevents our on-premises customers from getting the same benefit, and it ties us to Amazon.
- DB aggregations are great, but SQL cannot do everything (did I really just say that? :) ).
- We started to have a ton of downstream uses—tying the infrastructure only to analytics would not do. We needed to separate the pipeline (collect data, clean it, lightly aggregate it) from downstream of custom analytics, security analytics, and mobile analytics.
We needed a different architecture in order to grow with our customers’ needs. As we explored, we noticed that Apache Spark has come a long way. We decided to shift to the following architecture:
All our cloud customers were recently onboarded to this architecture. The new platform provides important capabilities for all of our users:
- Data scalability The platform scales fairly naturally to handle more data from API traffic, from any deployment pattern. We can scale to handle petabytes of data.
- Analytics scalability Likewise, the platform will scale naturally as more processing, query, or analytics are added. New processing capacity can be added independently without affecting the existing workload.
- Dynamic scalability The platform adjusts to load and data variability more gracefully. One-time or regular patterns of variability in data volumes can each be accommodated.
- Data and query availability Robustness is built into the different components of the platform to ensure very high availability.
Most importantly, this new architecture provides the foundation for extending the variety of analytics we can offer to our customers. The diagram above illustrates that we can now provide basic dashboards, custom reports, and a specialized security dashboard using bot detection analytics.
Going forward, we intend to provide: business analytics by combining API traffic data with other relevant information like products and plans; operational analytics by combining event and periodic metrics data to track the availability of resources and endpoints; developer analytics that narrows down objects of interest to developers; and extensions to support customers’ analytics.
These are exciting new capabilities that should provide a rich palette of analytics for your different user roles and help you obtain detailed insights from the data.