How Usergrid Scaled to Over 10,000 Requests per Second
We recently did some testing on Apache Usergrid, the open-source Backend as a Service, and found that it can reach 10,000 transactions per second, and can scale horizontally. This means the upcoming Usergrid 2 release is potentially the most scalable open-source BaaS available. Here's the story of how we got there.
What is Usergrid?
Apache Usergrid is a software stack that enables you to run a BaaS that can store, index, and query JSON objects. It also enables you to manage assets and provide authentication, push notifications, and a host of other features useful to developers—especially those working on mobile apps.
The project recently graduated from the Apache Incubator and is now a top-level project of the Apache Software Foundation (ASF). Usergrid also comprises the foundation for Apigee’s API BaaS product, which has been in production for three years now.
What’s new in Usergrid 2?
Usergrid 1 used Cassandra for all persistence, indexing, query, and graph relationships. The index and query engine wasn’t performing well, however, and was quite complex and difficult to maintain. Usergrid 2 provides a complete new persistence, index, and query engine and the index and query features are provide by ElasticSearch. This enables us to delete lots of code and trade up to a great search engine. Additionally, separating key-value persistence from index/query allows us to scale each concern separately.
As the architecture of Usergrid changed drastically, we needed to have a new baseline performance benchmark to make sure the system scaled as well as, if not better than, it did before.
Our testing framework and approach
The Usergrid team has invested a lot of time building repeatable test cases using the Gatling load-testing framework. Performance is a high priority for us and we need a way to validate performance metrics for every release candidate.
As Usergrid is open source, so are our Usergrid-specific Gatling scenarios, which you can find here: https://github.com/apache/usergrid/tree/two-dot-o-dev/stack/loadtests
Usergrid application benchmark
One of our goals was to prove that we had the ability to scale more requests per second with more hardware, so we started small and worked our way up.
As the first in our series of new benchmarking for Usergrid, we wanted to start with a trivial use case to establish a solid baseline for the application. All testing scenarios use the HTTP API and test the concurrency and performance of the requests. We inserted a few million entities that we could later read from the system. The test case itself was simple. Each entity has a UUID (universally unique identifier) property. For all the entities we had inserted, we randomly read them out by their UUID:
First, we tried scaling the Usergrid application by its configuration. We configured a higher number of connections to use for Cassandra and a higher number of threads for Tomcat to use. This actually yielded higher latencies and system resource usage for marginally the same throughput. We saw better throughput when there was less concurrency allowed. This made sense, but we needed more, and immediately added more Usergrid servers to verify horizontal scalability. What will it take to get to 10,000 RPS?
We started increasing the number of concurrent clients and adding more Usergrid servers. Once we got 10 Usergrid servers against our cluster of 6 Cassandra nodes, we noticed that our throughput increase was flattening and latencies were increasing. The Usergrid servers were fine on memory usage and CPU usage was starting to increase slowly.
It was time to see if Cassandra was keeping up. As we scaled up the load we found Cassandra read operation latencies were also increasing. Shouldn't Cassandra handle more, though? We observed a single Usergrid read by UUID was translating to about 10 read operations to cassandra. Optimization #1: reduce the number of read operations from Cassandra on our most trivial use case. Given what we know, we still decided to test up to a peak 10,000 RPS in the current state.
The cluster was scaled horizontally (more nodes) until we needed to vertically scale (bigger nodes) Cassandra due to high CPU usage. We stopped at 10,268 Requests Per Second with the following:
- thirty-five c3.xlarge Usergrid servers running Tomcat
- nine c3.4xlarge Cassandra nodes doing roughly100k operations/second
By this point numerous opportunities for improvement were identified in the codebase, and we had already executed on some. We fully expect to reach the same throughput with much less infrastructure in the coming weeks. In fact, we've already reached ~7,800 RPS with only 15 Usergrid servers since our benchmarking.
As part of this testing, not only did we identify code optimizations that we can quickly fix for huge performance gains, we also learned more about tuning our infrastructure to handle high concurrency. Having this baseline gives us the motivation to continually improve performance of the Usergrid application, reducing the cost for operating a BaaS platform at huge scale.
This post is just the start of our performance series. Stay tuned, as we’ll be publishing more results in the future for the following Usergrid scenarios:
- Query performance - this includes complex graph and geo-location queries
- Write performance - performance of directly writing entities as well as completing indexing
- Push notification performance - this is a combination of query and write performance
Image: Lara/The Noun Project
The following components are used by the Usergrid application, with the associated versions used for our benchmarking:
- Tomcat 7.0.62 where the Usergrid WAR file is deployed
- Cassandra 2.0.15 with Astyanax client
- Elasticsearch 1.4.4 (not utilized in these tests)
As part of benchmarking, we wanted to ensure that all configurations and deployment scenarios exactly matched how we would run a production cluster. Here are the main configurations recommended for production use of Usergrid: