Big Broad Data: Beyond the “Bigness” and the Technology
The amount of data in our world has been exploding, and the concept of “Big Data”—collecting and analyzing large data sets—needs no introduction. It is the new IT catchphrase.
A quick Google search, a track of #bigdata in your Twitter feed, or a five-minute conversation with IT folks about "Big Data" will reveal how the weight of the discussion focuses primarily on two things: first on technology, and then on "bigness.” While both are important, the focus on the technology and size is misplaced and causing us to miss the point: that the depth of analysis of the data and the insights from them are the most important and valuable things.
This article will discuss the new computing environment businesses must work in and how businesses can derive value from Big Broad Data generated from this new environment. The goal is to initiate discussion within an enterprise as to how it can leverage analytics and deep insights to achieve foresight and nimbleness in any market.
Hardly a conversation happens in the big data space that does not start with the pros and cons of Hadoop, NoSQL, Cassandra, Hbase . . . the list goes on. Big Data technology is of course extremely important because without it the signal cannot be separated from the noise, and large datasets cannot be managed. But technology is just a commodity; it is how you use it to generate valuable end results that is important.
Right behind the technology discussion comes “Bigness”—the petabytes vs. terabytes argument. There are certainly technical complexities to dealing with petabytes of data. But terabytes and even kilobytes are big and hold valuable information. Remember that much of the size comes from noise, whether you are dealing with kilobytes, terabytes, or petabytes. Big, noisy data is not valuable—the value comes from the signal that you can extract.To successfully create value from big data, we should focus on the breadth of the data, signal extraction, and deep insights. This should make us think about the implications and results of the technology but not the technology itself. Bottom line: the data itself and the insights derived from it is the real gold—the new currency.
The data from social, mobile, and cloud computing comprise the breadth of data. Data about an enterprise's customers is available and interpretable in all kinds of new contexts. A customer that checked in at the gym on Foursquare before visiting a retailer is likely to be interested in sporting goods. You can imagine hundreds of similar examples. What's a good example of value from extracting signal over noise? A Klout Score uses data from social networks to measure reach and social influence. This is an example of a “signal” that is extracted from a superabundance of data such as tweets and other social interactions. Deep Insight is about how people can take the output of various machines and systems and convert it into business value. We have come to know that the number of online shoppers who abandon their "shopping carts" part-way through their transaction is higher on Android devices than on iOS devices. This provides insight into the persuasiveness of Android apps and the behaviors of Android users.
Making the shift from Big to Broad Data
In this section, we will delve into what we mean by "Broad Data" as well as some of the fundamental changes for businesses in today’s marketplace that compel the need to focus on breadth of data and on data stitching from disparate sources.
The shift of control to the edge of the enterprise
Social, mobile and cloud influences have caused enterprises to undergo a dramatic transformation in how they do business with customers. The real value for an enterprise—the interaction with end users (customers)—has shifted one or two degrees away from the enterprise. The control is shifting to social networks where people are talking about companies and products, to business networks where interactions are happening through partner channels, and to apps and the APIs they leverage.
The landscape for customer interaction with enterprises looked significantly different just a few years ago. Data was controlled within the enterprise—all of the data was generated when partners and customers interacted with systems produced and provided by the enterprise.
But today’s landscape reveals an expansion of customer interaction branching out from the core of the enterprise. The evolution of apps and the API economy has resulted in people using apps created by the enterprise, its partners, and third-party developers. Apps then become conduits of information which provide important metadata about how customers and partners are interacting with them. Factor in the influence of social, partner,and business networks and the effect is amplified. Simply put, the enterprise is no longer in control of the data it needs to make accurate business decisions. That is the fundamental shift of interaction to the edge (and even beyond the boundaries) of the enterprise.
This shift to the edge results in a fundamental change for the Big Data conversation. The number and variety of data sources becomes much more important than sheer volume that comes from any one source.
Big Data becomes Broad Data
Data is not by itself "Big." Aggregated fragments of small and contextually related data make for "Big"—or more accurately—"Broad" data. Taking advantage of the breadth of the data, its variety, its dynamism, and its disparate sources is the key to deriving benefit from big data.Just a few years ago, the data an enterprise collected were collected from brick-and-mortar establishments, websites, and partners and from 5 to 7 primary data sources.Data from point-of-sales systems, supply records, customer records, warehouse records, and so on reflected an enterprise’s interaction with customers and partners.
Today the sources and types of data are expanding continuously—there are hundreds of new data sources, each generating differently sized packets of data, and thus resulting in a smaller signal/noise ratio.
The shift can be seen in the numbers—from 100% of data captured from the primary data sources to a scenario in which maybe less than 50% comes from these sources. In time, the old enterprise sources may not even be the most important.
The many new sources of data individually contribute a smaller amount of data than the previous enterprise-owned sources. New data source are diverse and include Twitter, Facebook, partners, tens and hundreds of apps most of which are built around APIs. This “spreading” of data defines the need for the shift from the deep and big focus of the old world to the broad and pervasive focus of the new world. This new definition will allow businesses to focus holistically on all of the new data streams where they can find signals relevant to their business.
Big Broad Data: Increasing the Signal-to-Noise Ratio
Being responsive to up-to-date data and aware of its varied origins will allow any company to make insightful business decisions. In this section, we will discuss how to be responsive by extracting meaningful signals from amassed data.
Signal extraction and stitching data from diverse sources
Whenever you collect a lot of data, you collect a lot of both signal and noise. In fact, the most prevalent approaches to Big Data focus on extracting the signal from the noise in the data collected from the traditional enterprise.
Now with the footprints of customer and partner behavior spread across hundreds and maybe thousands of data sources (see Making the shift from Big to Broad Data), there is undoubtedly a smaller signal/noise ratio than before. Data from diverse and disparate sources need to be stitched together to ensure that the signal rises above the noise.
Successful enterprises will be those who stitch and derive insight from the data that they own as well as that which they can acquire. Today, the challenges are striking deals and forming partnerships to get access to hundreds, not handfuls of data sources. Purchasing data or scavenging internal departments for information has now become antiquated.
Enterprises need to think about access and control. In the past, data was processed in a central repository within the enterprise—whether in a warehouse or through other techniques. How it could be analyzed was a lesser concern.
- Which mechanisms are best suited to stitch data from its parent enterprise together with data from syndicated and external sources?
- How will people interact with external data?
We’ve got to understand the form in which this data will be exposed and how it will be consumed.
In Web 1.0, techniques centered around Web crawling; in Web 2.0, Web pages, AJAX and other "rich" interface technologies were the focal points. As we move towards Web 3.0, broad and diverse data will be disseminated, consolidated, and correlated through the power of APIs.
Today, transactional APIs (which perform tasks like sending messages, making trades, getting credit information, and so on) tend to dominate the API conversation. But we are now experiencing a revolution in the world of data APIs because the future is around the easy consumption that depends on the flow and interaction of data. APIs are central to the handling and processing of big, broad, diverse data, and the structure of the data APIs becomes increasingly important.
Big Broad Data: the Role of Data APIs
Now that we have discussed that we have to stitch together data from disparate sources to draw deep insight, we have to determine who should expose their data through APIs and start discussion on possible pragmatic solutions for exposure.
Building an Information Halo around APIs
There are two types of enterprises from a data perspective: those whose core business and asset is data and those who has valuable data which can be distributed to increase transactions to their core business. In both scenarios, data is a fundamental and critical part of an API strategy.
In the latter case, the data (information) itself is not necessarily monetizable, but it can attract people to the core business. This available information surrounding your business is referred to as the “Information Halo.”
For a great discussion about the notion of information halos around your core business, see: Amundsen’s Dogs, Information Halos and APIs: The epic story of your API Strategy.
Enterprises that are monetizing data are now creating specific domains of influence. Weather, finance, real estate, Internet traffic and hundreds of other domains are forming.
Those companies who have successfully created a data domain do so by collecting data from disparate sources. To construct any particular domain, a company mostly likely stitched together data from these many sources, cleansed and standardized it, and then offered it to developers through an API.
What are companies using to collect and stitch the data?
A natural and familiar stitching technique is the linked data model (linkeddata.org). While linked data techniques are excellent at accessing individual data elements, this is not the model that these data providers need. Data providers are companies who plan to provide data through APIs after stitching data from disparate data sources. Thus by their nature, they need to crawl, bulk load, and access data in large quantities, before cleansing, standardizing, and delivering it.
If linked data is not the most effective method to stitch data to create domains, then should linked data become the de facto standard to express data from the information halo (at the top of the stack and as the data API for domains)?
The answer is probably yes—eventually, although unrealistic at the current time. Today, the challenge for business is how to cleanse, standardize, and unify data for use in a particular domain. Linked data techniques have the right characteristics to bring together data that have already been cleansed, standardized, and stitched but is not a great model to do the initial stitching. It will most likely become useful and common in the future when the inter-linking of domains becomes more important than it is today.
If linked data is not the approach to expose data as APIs, what is?
Schema-based access to data APIs patterned after relational models is a preferred method. Here are a few examples of data APIs, which highlight three common kinds of data access patterns.
- Primary key lookup: uses a keyword to find a specific data element
- Imposed hierarchy-based lookup: the search query traverses classes with hierarchies to yield the desired data elements
- Rectangular lookups: defined by typical relational lookups of rows and columns.
All of these techniques are being built around single data sources as opposed to massively linked data sources.
An Opinion on the Future of Data Protocols
There are several approaches to data APIs. In addition to Pragmatic REST for SQL Developers, there is also Microsoft’s OData approach. OData is a step in the right direction, but there needs to be much work done before it becomes the standard for Data APIs. The “RESTification” of the Data APIs is a fundamental imperative and both the Pragmatic REST for SQL and the OData approaches are good starting points.
Whatever the solution is, it cannot be vendor specific. Data is too important, and the data revolution too fundamental for it to be associated with any one vendor. OData technologies need to be available in all ecosystems, not just in the Windows Foundation Classes (WFC) library and the .NET Framework. Similarly pragmatic REST and other techniques cannot be available in Apigee or any other single vendor offering only. Let the call to action be to come together as a community; get the best of the linked data and OData ideas and techniques together and transform the world with data APIs.