Pavel Ignatov - Fotolia

Challenges of combining structured and unstructured data

Business intelligence tools can turn company data into valuable insights, but unstructured data is often a missing component. It doesn't have to be.

Shawn Shell, Hitachi Consulting

Published: 17 Aug 2015

Today's business intelligence landscape has exploded. The availability of data-generating systems creates a treasure trove of useful insight. Tool vendors such as Oracle, Microsoft, SAS Institute Inc., Tableau Software, Pentaho Corp. and IBM have created increasingly sophisticated and accessible analysis suites that can guide decision making. In doing so, these tools create real value that can easily justify significant business intelligence (BI) investments.

Unfortunately, a great deal of the data is locked in unstructured content. To make matters worse, much of the existing structured data uses inconsistent languages and business definitions. A truly comprehensive picture of the most valuable insights comes only when rationalized structured data is combined with unstructured content.

Creating the right master data

master data provides common terminology for objects across an enterprise This could be customers, materials, suppliers, geographies and products. Without master data, it's impossible to accurately construct any substantive insights. This is difficult enough in the exclusively structured data world, where data can be organized by a series of columns and column headings. But organizations are also inundated with unstructured data, which is more free form and can't be easily categorized in an Excel spreadsheet. A lack of defined and consistent master data creates an additional hurdle when trying to combine structured and unstructured data.

To overcome this challenge, start with a few key systems -- ideally, revenue-generating systems. For example, customer relationship management (CRM), ERP and other sales or production data should take priority over HR data. Prioritizing these applications allows you to focus efforts on data that's key to revenue. The master data for these systems not only helps orient key structured data sets, but it lays the foundation for joining to unstructured content through metadata.

Creating and extracting appropriate metadata

One of the major challenges in getting value from unstructured content is that, by definition, there's no definitive data structure. Unstructured data usually comes in the form of documents, photos, videos or other blob file types. Unlike a relational database, these storage structures lack the organization to easily integrate with other data.

But, metadata will often "wrap" some of these file types and enable you to partially integrate. For example, digital photographs are usually coded with basic metadata, such as exposure, focal length, shutter speed and -- in many cases -- a geotag (a GPS record). Microsoft's Office suite automatically attaches document properties. These document properties usually include basic information like title, author and subject.

By using native metadata, combined with metadata applied through a content management tool, you can effectively link this unstructured data to structured data sources. Further, master data within your organization can be used to create specific metadata fields and values. Since the fields and values will be consistent with structured data, you can precisely link these disparate data sources.

Classification and transformation

Another key challenge when combining structured and unstructured data involves getting access to deeper insight locked in those blob files. Basic metadata can provide some insight, but it stops short of providing the depth of analysis most organizations need. As such, organizations must more precisely classify and, in some cases, transform unstructured data.

In reality, much of this work results in more extensive metadata. For document-based data -- such as Word, PDF or Excel -- entity extraction tools can construct both metadata fields and the corresponding values. These additional descriptors and their values add structure around the unstructured content. This structure can then be analyzed, much like structured content. The underlying document becomes an attachment to that additional structure.

If the content is locked in something less accessible, such as images or videos, more specialized processing is required. For videos, especially those with audio, transcription is often the correct first step. Firms like RAMP can ingest, process and return a full transcription of video with an audio track. This transcription is further time-coded to enable the video to be decomposed into segments that match a specific passage in the transcript. If your content is contained in images, optical character recognition will help extract text-based content that can be further processed through entity extraction. For non-text content, such as photographs of objects or people, you must turn to other techniques like object recognition. This approach allows adding descriptive metadata about the image contents, beyond the recorded mechanics contained in EXIF data. Look no further than your smartphone, camera or favorite photo tool for an example of this technique -- many of these devices already contain facial recognition and automated people tagging.

Reporting and analysis

Traditional BI tools are geared toward structured data. That has been the nature of these tools since their inception. As a result, they tend to provide little or no support for unstructured content. However, assuming you've applied a consistent set of master data and a robust set of metadata values based on inherent data and classifications, it is possible to use typical business intelligence suites.

But combining traditional BI tools with other technologies, such as Enterprise search, can lead to better insight and greater use among workers. Like transforming unstructured data, structured data can get exposure through search engines. Search technology, in particular, is well-suited to provide a composite view of structured and unstructured data. Search tools naturally present facets -- dimensions in BI parlance -- as well as adding a thesaurus function to assist finding individual records or groups. They also allow a user to drill into a particular result, with related content more easily than BI tools. In fact, many modern search engines have made a point to add support for indexing both structured and unstructured data sources.

Bringing it together

There is value in bringing together the worlds of structured and unstructured data. Firms have a treasure trove of valuable content, but much of it is artificially segregated through inconsistent master data or merely the format in which it's stored. To continue innovation, firms in different industries need to find the right way to create these composite views. Start with creating the right master data, apply that master data to all the content, extract hidden metadata in the unstructured data and create an analysis tool set that allows business users across the organization to access those insights.

Next Steps

Merging structured and unstructured content in SharePoint

Analytics could improve enterprise search

Content analytics are becoming a must-have

Challenges of combining structured and unstructured data

Business intelligence tools can turn company data into valuable insights, but unstructured data is often a missing component. It doesn't have to be.

Creating the right master data

Creating and extracting appropriate metadata

Classification and transformation

Reporting and analysis

Bringing it together

Next Steps

Dig Deeper on Content management software and services

How businesses should deal with enterprise search issues

How to get structure from unstructured data

The ins and outs of unstructured data protection

Unstructured vs semi-structured data: Order from chaos