The Paleo Diet: Unstructured Data for the Enterprise CEO

Shomit Ghose works with ONSET Ventures.

The original Big Data dates from mankind’s Paleolithic age: speech, pictures and writing. The data gained from sight (text, images) and sound (language, music) remain the essential media of communication for humans today. Unfortunately, the data that’s conveyed in speech, text and pictures falls into the category of “unstructured” data as it has no defined structure unlike, for example, numeric data that can be easily mapped into a database and interpreted. For enterprises, the Paleo Diet – i.e., accessing and unlocking the insights contained in unstructured data – presents the key strategic opportunity in the field of Big Data.

As we swim in ever greater oceans of Big Data, IDC has found that 90 percent of this digital information is unstructured content. For the enterprise, this unstructured data takes the form of social media posts, call center notes, email, images, video, Web content, sensor and mobile data, warranties, contracts, sounds, shapes, ads, click-streams, Office documents, X-rays, MRIs, doctors’ notes, real estate listings and annual reports. Needless to say, the rows and columns of a traditional structured database are completely unsuited to organizing and making sense of unstructured data.

If properly harnessed and combined with structured data, unstructured data promises to deliver to enterprises deep, 360-degree views of customers. Unstructured data is a powerful resource for applications like audience clustering, predictive marketing and sentiment analysis. Essentially, while the stream of structured (transactional) data readily explains what is happening at the moment, the stream of unstructured data can yield insights into what’s going to happen, or why something happened.

To date, structured data has been the basis of enterprise analytics because it’s relatively easy to interpret: structured data is primarily numeric, repeatable in type, and predictable in timing and treatment. Unstructured data is far more challenging. Not only is the data volume vastly greater, but unstructured data has (by definition!) no inherent format or repeatability, and brings with it an extremely unfavorable signal-to-noise ratio.

Further, securing unstructured data is an added challenge (think regulatory risk) given that its content cannot be known a priori. As well, different industries have different levels of reliance on unstructured data, and different departments within the same company may rely on entirely different sources of unstructured data: Marketing on social media; engineering on design documents; customer support on call notes; finance on emails; sales on contracts; and HR on employee reviews. For an enterprise, making sense of unstructured data can seem a daunting undertaking.

So, how does a C-level executive manage an unstructured data initiative within their company? Despite the apparent complexities of bringing unstructured data into the enterprise, the recipe for embarking on a corporate Paleo Diet is rather straightforward:

Clearly understand the business. Without a well-defined business use case, it’s impossible to know what unstructured data is required, how it should be interpreted, or whether in the end the unstructured data initiative has been successful in improving the bottom line. First and foremost, begin by understanding the types of insights the business needs.

Identify the sources of unstructured data. Sources may be internal or external, but they must exist. If multiple sources are available, a good starting point might be to work with the source that’s growing most rapidly.

Clean the data. Unstructured data is voluminous, “noisy”, and brings lots of redundancy. For unstructured data to be usable it must first be cleaned, de-duplicated and consolidated. Confidential data that requires special handling should be secured or completely masked.

Structure the data. Without a business-driven categorization overlaid upon it, unstructured data is useless. Structure must be imposed according to classifications that are critical to the business. As more data is ingested (with homogeneity never a given), the categorization must be automatic through techniques such as text analytics, auto-tagging, and auto-taxonomy generation.

Integrate the structured data with the unstructured data. Big Data will only reach its full potential when structured data is seamlessly combined with unstructured data.

Build a feedback loop. Is the classification of the unstructured data optimal? How should it best be integrated with the structured data? Is the unstructured data helping drive meaningful business decisions?

Support unstructured data as true production data. If unstructured data is used to drive an enterprise’s business, then it needs to be appropriately treated as a production asset. It must be secured; it must be testable; it must be resilient to failure; it must support archiving and recovery; it must be accessible.

We continue to flood the Internet with many trillions of gigabytes of data annually, overwhelmingly through the unstructured data of text, images and video. The opportunity for enterprises to gain insights from this, the largest and oldest class of business data, is immense. With a thoughtful strategy, and thanks to enabling technologies such as Hadoop, enterprises are now able to drive predictive analytics from the patterns and connections hidden within the 90 percent of Big Data that is unstructured. A comprehensive strategy that marries unstructured and structured data promises to fully and finally deliver the benefits of Big Data to the enterprise.

Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.

Comments

Plain text