11436 SSO

Predictive Modeling with Bayesian Statistics and Behavior Graphs

Feb 24, 2015

In a previous post on Apigee Insights, our predictive analytics platform, we discussed how behavior graphs characterize the sequences of events drawn from various channels of customer engagement. We introduced the concept of the GRASP (graph and sequence processing) data structure as a means of representing time-stamped event streams and a query language for examining event sequences, at scale. Now we’ll elaborate on the application of this technology for predictive modeling.

What kind of predictive models?

Predictive models of customer behavior are needed for recommendations (for example, to determine specific offers for an individual) and for targeting (to define a group of customers for a specific offer). These two types of models have been used in many marketing and customer experience management scenarios. Previously, we focused on describing customer behavior using GRASP; now we'll discuss how GRASP enables us to create precise predictive models.

Why Naïve Bayes?

Traditional predictive modeling approaches typically analyze customer activity using variables or features that represent the profile of the customer (demographic data, for example) and an aggregated representation of their prior behavior (say, the number and type of products purchased by the customer in a prior seven-day period).

Customer intent is modeled using a supervised learning algorithm, which uses historical data to learn patterns of user behavior based on the variables mentioned above. The Naïve Bayesian algorithm, which uses a statistical approach and is based upon the Bayes Theorem on the probability of event occurence, has gained popularity because it's suitable for classifying (or predicting) multiple outcomes, each associated with a propensity score. The Bayesian approach is robust and less likely to find false patterns in noisy data ("over-fit" the data) than some other approaches, including logistic regression.

Naïve Bayes with GRASP

When the feature set or variables used for the predictive model are based on aggregations of customer behavior, many different aggregations may need to be examined. These are sensitive to the intuition of the analyst; for example, the analyst typically would decide whether the counts of purchases over the past one-, three- or seven-week period are most indicative of the customer’s propensity to buy a product. Apigee Insights uses the Naïve Bayesian classifier, and considers not only the customer profile variables, but sequences of customer activity.

Consider this example: a customer views a series of products, adds some to the shopping cart, and then buys some or all of them. GRASP enables us to store snippets of customer activity, and find sequences that are common to, say, a threshold minimum number of customers. These sequences are called “microclusters” and are indicative of sequences of activity that other customers might undertake in the future. These event-based models have proven to be precise in several scenarios.

Implementing Bayesian event-based predictive models

At the core of the Bayesian approach are probabilities of events and conditional probabilities, which are probabilities of events that depend upon other events. In the equation below, P(E) denotes the probability of an event E occurring, and P(E|D) denotes the conditional probability of event E to occur when we have observed event D.

The Bayes rule is expressed as follows:

The space of events D can be computed using the combination of customer profile data and event sequences that occur somewhat frequently (it's tunable using a threshold). The event E represents a desired outcome—for example, a product purchase.

Imagine the customer of a given demographic and with a history of having viewed some products; for this customer, you can now build models that predict the propensity that her next action would be to:

  • add a product to a cart (in other words, model the customer’s interest in a product)
  • buy some product (distinguish browsers from buyers)
  • buy a specific product (compare the customer’s propensity to buy diapers versus her propensity to buy a pair of shoes)

The Bayes rule for “add to cart” might look like this:

Because many factors can be involved, computing the probability that events occur together can quickly become unruly. To simplify, we make a very strong assumption—the “Naïve Bayes” assumption.

The “naïve” part in this approach stems from an assumption that the act of visiting a site is conditionally independent of the shopper’s age, as long as the action of adding to the cart took place. In other words, conditional independence implies that the event representing the site visit and the variable representing the shopper's age are independent of one another, given the varible corresponding to adding to the cart.

Using rules of probabilities, this can be expressed mathematically as follows:

Our overall formula becomes:

Since this formula contains individual probabilities of its constituents rather than their dependence on each other, it is computationally tractable. Upon closer examination of our formula, we might discover a flaw that is remedied by what is called a “Laplace estimator.” More specifically, if one of the multipliers in the numerator happens to be zero, the whole probability will be zero, which might not be accurate. 

This is called the “zero frequency” problem; it’s how you estimate the probability of an event that you have not observed. Setting the probability of such events to zero is not correct, and would give rise to serious problems in the Bayesian approach. Instead, an estimator like the Laplace estimator, which ensures that all probability estimates are non-zero, is used to ensure that the Bayesian formula does not produce a zero probability.

Programming the Naïve Bayesian classifier

We have chosen Hadoop to store and manipulate the large amount of data available for these types of calculations. We use the MapReduce approach to calculate the individual probabilities in the Naïve Bayes formula. The mapper function identifies the features that satisfy a particular constraint and the reducer calculates the probabilities. When the component probabilities are available, they’re plugged into the formula to calculate the final probability.

We hope that this has been a useful introduction to the machine learning technique called the Naïve Bayesian classifier. Coupled with the GRASP representation of customer activity, this approach combines information about customer profiles and activity to predict customer behavior. It is especially effective when events from multiple sources, say, a web site, a physical store, and a mobile app, generate data about customer activity.

Coming up in this series, we'll present the viewpoint of a data scientist who spent many years working with a variety of predictive modeling tools, and discuss how this new predictive modeling approach is put to use.

Photo: Nina Matthews Photography/Flickr

Scaling Microservices