Observations on API and Mashup Management

API and Mashup Blog

Performance tuning

Since we deployed a major database migration last week, we've been continuing to tune the database for better performance. This has resulted in:

  • An analytics backlog from the two days following the migration (5/9 - 5/10). If you see a dip in your analytics data, it may be because of this. We're crunching the reports as fast as we can, but it may take another day or so to completely catch up.
  • On 5/10 there was an outage that lasted 1-2 minutes. This happened during the deployment of a bug fix. Going forward, we're investigating how we can better automate this process in an effort to reduce human error.
  • On 5/10 - 5/13, the database has struggled under load, sometimes making display of analytics reports impossible. Since Apigee is a highly available system, rest assured that we haven't lost any of this data, but there may be other hiccups like this as we continue performance tuning.

We strive to be as transparent as possible—ultimately we plan to have a comprehensive status page that provides up-to-date information about the service. But in the meantime, we'll continue to post these updates to the blog.

Planned Maintenance: Friday, May 7th

Tonight, starting at 9pm Pacific Time, we will be deploying some major architecture changes in order to improve Apigee's service and to keep pace with the rapid growth we've had in the last month since rolling out major new features. From our benchmarking, we expect this to take 4-5 hours, though possibly longer.

We've designed Apigee to be always available. PROXY TRAFFIC WILL NOT BE AFFECTED. All historic analytics data will be migrated and during the migration, and we will continue to collect analytics data during the entire time.

Two things will be unavailable during this migration:

  • Sign up will be disabled
  • Snapshots from the API Test Console will be disabled

So what are we doing? For one, we're moving from Postgres to MySQL. Secondly, we're moving to bigger boxes to increase performance.

Questions? Concerns? Let us know in the comments or in support.apigee.com.

All Services Up; Previous Service Outage Postmortem

As we've written about before, we've built Apigee to be highly available. However, our architecture is still somewhat vulnerable at the data center level. Last night Amazon Web Services had a power outage in an Availability Zone in the US East Availability Region that by sheer chance took down Apigee's entire ServiceNet cluster (Sonoa's enterprise-grade API management solution). This affected the API traffic for all Apigee users.

  • At 2:23 AM PDT a power distribution unit malfunctioned, which brought down our instances at 2:32.
  • By 2:36, our engineering team were already scrambling to discover the extent of the problem, which took about 20 minutes due to the degree of the failure. 
  • At 3:04 a work-around had been deployed and Apigee was back online.

In the end, thanks to quick work by the engineering team, Apigee was down for less time than the AWS outage lasted. Until now, Apigee had only been down for 5 minutes (also in this past week) during a planned server migration—something which we have now designed into our scripts to avoid in the future.

This kind of failure is something we've known about and already designed a solution for; however, it's going to require some significant engineering work to implement. Our plan is to stripe our service across data centers. We expect to have this solution in place in the next 4-6 months. When that happens, any outage will failover to the redundant datacenter, which will result in downtime of less than a minute. Keep in mind that Amazon Web Services only has a guaranteed uptime of 99.95% by region, making it impossible for us to exceed while being in a single region.

Providing bullet-proof service is of critical importance to our users. If you have any questions, concerns, or suggestions about availability, please let us know in the comments or on http://support.apigee.com. We apologize for any trouble this 

Addendum: We did also discover that we weren't properly monitoring the service that handles our SSL traffic, resulting in a longer downtime of roughly four hours. Thanks to Jonathan who reported the outage in the support forum. This was an oversite and has since been added to our Nagios monitoring system.

Addendum 2: We also have plans to provide a status page for Apigee soon. We'll post details as we get closer.

Addendum 3: At 5:39 PDT today AWS reported a 2nd outage affecting the very same boxes (our systems registered the outage at 5:32). We scrambled to recover, bringing full service back online by 5:52. All systems are currently working properly.

Recent Apigee upgrades and fixes

Earlier this week we released significant UI upgrades for Apigee. We also fixed some bugs identified in our feedback forum.

(We should mention that you won't lose API messages during our planned upgrades - we use a HA architecture so we don't have to schedule downtime.)

The big one is that now the cool screens in the demo video - the animated API setup and proxy rate limiting dialogues (as in this video) - are now live for everyone.  They include some usability tweaks identified by our preview members - thanks.

We also fixed some bugs you found in the API Requests Table and with editing API nicknames.

Here's an overview of the  UI highlights and below are details on more features coming soon.

What's coming next

We're prioritizing our backlog directly based on your feedback.  Coming up soon:

Support for HTTPS & SSL. Thanks to everyone that commented on the solution proposal.

Proxies will come in 2 flavors -  consumer and provider, and we're including Apigee status info in the response header

Finally, we're thrilled when one of our community members suggest a better solution than we had planned.   We love this idea to make IP addresses human friendly.

Thanks again and please keep the ideas coming!