11436 SSO

Managing Big Data

Feb 24, 2012

In my recent Data Analytics & APIs post, I talked about analytics as a peer to your API, and using analytics to gain insights to drive success of your API strategy.

An API strategy quickly gets an enterprise to a place where they have big data sets. As defined by Wikipedia, "Big Data consists of datasets that grow so large that they become awkward to work with using on-hand database management tools."

No question that Big Data comes with big opportunity for valuable insights but also with challenges and questions about data storage, how to effectively manage the growth of data sets, methods for understanding data analytics and gleaning the most valuable insights from it.

Many people have commented, "big data is great, but before I get an ounce of value out of it, I am stuck with the cost of storing and managing it."* This is an understandable pain. However there are some simple solutions to manage this pain. Once managed, the gains from Big Data are easily achieved.

Let’s take a look at an enterprise with APIs that are getting requests at about 1000 TPS.  Even if, conservatively, each request generates 1K of data, we are looking at ~100GB/day, or ~300TB/year, for just the raw information.  While people tend to focus on the “storage” costs of managing this kind of information, it is almost always the people costs that dominate.  Then add the complexity of running analytical tasks, and the simple “systems management” problem (not the byte storage costs) will tend to be overwhelming.

I have argued in the past that APIs benefit from cloud deployment.  Today I will argue that the analytics on APIs also benefits tremendously from cloud deployment.

People Costs

Let’s first tackle the people costs. Managing hundreds of TB is not for the faint of heart.  Clearly, when you have such a large volume, one would like to use the cheapest storage, which often times is significantly more unreliable than SAN storage. The storage needs to be backed up.  The hundreds of servers need to be monitored. When bad things happen (and when is it that bad things never happen? ;-)), corrective action needs to take place. Many of the Big Data technologies are immature, and have a fast release/patch cycle, so someone has to worry about patch management too.  The list goes on and on.

One soon realizes that it is often better to have some experts “manage” this for you – typically a cloud delivered service.  The service provider has in-house data management experts, might even have in-house data scientists – this allows you to focus on the business you have to run, and the value that Big Data will deliver, rather than all the blocking and tackling that needs to be done managing that data.

API Burstiness

APIs by nature are “bursty” in use. How does one provision the Big Data infrastructure that is needed to support this (over provision for peak, or under provision for average)?

It is absolutely clear that instantaneous provisioning of new data and storage is not an easy model for any enterprise to execute on, let alone a small to medium sized enterprise. Failure to provision leads to enterprise amnesia#. The data is either lost forever, or can only be analyzed days or weeks after the fact, when the new systems are provisioned.

It therefore makes perfect sense to look at a cloud delivery model for your API analytics.

I am not suggesting that there aren’t issues to deal with as you make these decisions. For example, issues about data security are always top of mind. However, as cloud-delivered services mature, these issues are less and less bothersome.  The benefits are very little pain in moving your API analytics to the cloud, and not having to worry about all the headaches of managing the infrastructure, so that you can focus only on the gains.


* Hadoop Has Promise but Also Problems- Jessica E Vascellaro, Wall Street Journal
# Enterprise Amnesia: Organizations Have Lost Their Minds Jeff Jonas


Scaling Microservices