11436 SSO

Building an Event Sequence-Based Predictive Model

Feb 27, 2015

I just successfully built my first event sequence-based predictive model! Here’s why I consider this a big deal.

I started my professional journey in predictive analytics in 2000 (right after helping Bank of America prepare their systems for Y2K compliance) with a small predictive analytics startup called Stone Analytics. Little did I realize that this would evolve into a highly explosive field known as data science.

So over the last 15 years I have been exposed to a lot of applications and algorithms for doing predictive analytics and supervised and unsupervised machine learning. I’ve built quite a few models in my career, but I’m particularly excited about my latest one because it marks the first time I was able to use temporal event sequence patterns as inputs.

The status quo: behavior summaries

Most popular machine learning algorithms available today are designed to seek out patterns in profile attributes, or, at best, features (input variables) that summarize the subject’s (customers and users, for example) behavior as a frequency of visits, or as a set of dummy variables. For example, in a model to predict 30-day hospital readmissions, along with patient profile and diagnostic data, you would summarize how many times a patient was admitted into a hospital during the past six months as a single feature.

Likewise, you would summarize other care events, such as emergency room visits, urgent care, and diagnostic services received. This is done to approximate and simplify the patient care experience, also known as the care “journey.” This is how I built a 30-day readmission risk model for a large East Coast healthcare provider before joining Apigee.

The importance of understanding sequence

Now let’s consider what wasn’t in the model: the actual sequence of care events, such as the sequence of care providers that a patient interacted with, sequence of procedures, medical equipment used, medication administered, hospital rooms employed, and other services received during a care journey. Using this type of event sequence data along with patients’ profile and diagnostic information, patterns of care emerge that result in higher readmission risk, or that reveal sources of risk for hospital-acquired infections throughout the entire care journey.

Such sequences would indicate future patient care outcomes, but would be lost if the care events were simply aggregated and represented as counts. This approach of leveraging event-sequence patterns applies to numerous industries outside health care, including retail and travel and hospitality, where there are potentially millions of customers and terabytes of behavior data that is captured but rarely used.

A graph structure of temporal event sequences

In fact, I think we can safely argue that when you want to predict what a patient or customer might do next (whether they’re likely to be readmitted within 30 days, or call into a call center), understanding the sequence of prior interactions is far more relevant than their profile data (things like age, income, and gender).

Since joining Apigee, I’ve talked with many data scientists who are trying to leverage these types of patterns in event sequence; this kind of data comprises much of their data lakes. They’ll often say: “It’s really time-consuming and hard to query or structure my data to do this type of analysis.”

In fact, the ideal data structure for capturing this type of information is not the relational or tabular data structures that most machine learning algorithms rely on. It requires a database normalized for storing event sequences—a graph structure of temporal event sequences.

There certainly are many open source graph databases out there today, but most, if not all, are designed for social graphs, or graphs of other types of relationships, like web page links. This is very different from the sequence of interactions and events that occurs over time.

Collecting behavior data using APIs

Efficiently leveraging event sequence patterns requires a process that create not only a graph data structure of millions of events, with even more traversal patterns across those events, but also an efficient algorithm that can query and find patterns that exist across this high dimensional, sparsely populated space. This is exactly what Apigee Insights, our predictive analytics platform, enables a data scientist like me and our customers to do, many for the very first time (like me).

For Apigee customers who deploy Edge, the Insights technology is particularly exciting in that they have a mechanism for collecting additional behavior data using APIs deployed in multiple devices and points of interaction. For many, this will be a new source for behavior information and near-real time contextual information that could be leveraged to provide better service and more personalized experiences to their customers, leading to a much more impactful interaction with them.

Scaling Microservices