Enterprise Analytics: The Art of Knowing the Known Unknowns
There are known knowns. These are things we know that we know. There are known unknowns. That is to say, there are things that we know we don't know. But there are also unknown unknowns. There are things we don't know we don't know. ~ Donald Rumsfeld
The topic for today is the art of knowing the known unknowns - applied to enterprise analytics, not war! After all, we are here to create value, not destroy it. Quite simply, known unknowns are enterprise metrics that we know are important, but we just do not know what their value is at any given moment in time. In other words, KPIs! (You might say, "Anant, couldn't you have said KPIs without getting all cutesy on me?")
It is not that the list of KPIs does not change, it is just that it changes infrequently. Typically, KPIs fix "metrics" (things that are measured) over a set of independent dimensions. Almost always, in the app economy, one of the dimensions is time.
At Apigee, we deliver these KPIs to our developers by building a custom schema and a custom app built on a sharded Postgres system. Why deliver KPIs through a relational database system as opposed to, say, Hadoop? There were three things that led us to this decision.
There is a fixed part to any data relating to an API call - timestamp, IP address, app name, API name, resource, response time, error, ... Databases are great for storing and accessing these known attributes.
There is a variable part for a customer/developer, and a variable part to the API and its payload.
However, if we create a multi-tenancy model wherein each different tenant gets its own table, then the same schema capabilities can be extended to the variable part - we just cannot mix and match different tenants in the same table. We deemed that the degree of performance improvement we would get by a very fine-grained multi-tenancy is not justified, because it creates a very high code complexity. Maybe Salesforce.com can afford it, but we definitely could not.
Because we do not allow any ad-hoc SQL to be submitted to the query layer (all SQL is hidden behind APIs that we have), sharding is a much easier problem, and therefore the automatic sharding of Hadoop does not compensate for the lack of schema support that we need in 1 & 2 above.
After having made the decision of Postgres, there were a number of technical choices we had to make, such as schema design, indexing (lack thereof), and so on. More on that in a separate blog, but for the curious, I have blogged about it in the past.