Content management software: Who will leverage semi-structured and unstructured data?

By acquiring Documentum and FileNet, the two major content management software companies, IBM and EMC are laying claim to a large body of existing users who are managing semi-structured (text, spreadsheets) and unstructured (Web graphics, video, audio, presentations) data.

Over the last couple of years, EMC has acquired Documentum and IBM has acquired FileNet. Documentum and FileNet...

are the two major content management software companies. More than that, they are relatively old content management software companies that have built on an ability to manage document and spreadsheet data at the desktop and LAN level to enter content management in the Internet boom years and survive the Internet bust. So by acquiring Documentum and FileNet, IBM and EMC are laying claim to a large body of existing users who are managing semi-structured (text, spreadsheets) and unstructured (Web graphics, video, audio, presentations) data.

IT is accustomed to viewing data management through the lens of the enterprise relational database managing structured relational data. In point of fact, by some estimates, the proportion of data in the typical enterprise that is semi-structured or unstructured is approaching 90% and rising. Much of that data is related in some way to key enterprise data. In other words, even if most vital enterprise data sits in enterprise business-critical applications, the related data -- customer pictures, X-rays, security tapes, email, news footage -- that would enhance the enterprise's ability to interact with customers and suppliers, brand the company more effectively and allow effective response to Sarbanes-Oxley legal discovery is sitting outside the enterprise-application data stores. Thus, the enterprise badly needs better ways of managing its semi-structured and unstructured data -- and relating it to mission-critical relational data.

By themselves, Documentum and FileNet would probably never have been able to provide the integration with relational data that users need. On the other hand, without credible content managers, neither IBM nor EMC could have provided the broad support for semi-structured and unstructured content that users also want.

But having carried out the acquisitions, what customer benefits are likely to emerge from either of the two in the short term? Or, to put it another way, what is the "end game," the database architecture that will allow effective content management and integration with relational data, and how close is each of the two to achieving it?


For EMC, Documentum seems to represent the keystone of an architecture that will introduce database-level and information-level metadata into a basically storage-oriented product set. The aim of introducing this metadata is to enable effective ILM (information lifecycle management) and intelligent information management (the next phase, requiring data classification in order to optimize storage of data not only by its age, but by other characteristics of each type of semi-structured and unstructured data).

This metadata, spanning relational/semi-structured/unstructured data, is of use not only in storage management but also to databases and business-level strategists seeking to achieve the real-time enterprise or to leverage corporate information for competitive advantage. In order to provide such a global metadata repository, EMC will need to inhale metadata not only from Documentum but also from other content managers, databases and applications (see Figure 1;note that Documentum offers some adapters that allow input of non-Documentum content into a Documentum content store). In other words, to meet the oncoming need, EMC should build its repository upwards in the enterprise-architecture software stack.

Figure 1: A Possible Global Metadata Repository Architecture

Source: Infostructure Associates, August 2006

IBM and information on demand

With the acquisition of FileNet, IBM now can add the mass of content that FileNet controls to its arsenal. In order to integrate this content with other content, IBM has the "enterprise content integration" capabilities of WebSphere Information Integrator that IBM gained with its acquisition of Venetica. In order to amass metadata from applications, databases and content managers and place the metadata in a global metadata repository, IBM has Information Integrator itself.

At the same time, IBM can now store semi-structured and unstructured data in DB2 itself. DB2 9 ("Viper") enables coexistence of relational and XML data (IBM calls its approach "pureXML"):

  • XQuery transactions can be performed on XML and relational data.
  • SQL transactions can be performed on XML and relational data.

What differentiates IBM from, say, Oracle and Microsoft in this type of "hybrid" database is that IBM makes a great effort to associate data-type-specific metadata and indexes with each set of XML data, rather than treating it as an undifferentiated mass of objects. This, in turn, allows IBM to optimize transactional performance for each type of XML data.

Since XML allows users to encapsulate semi-structured and unstructured data as XML messages, support for XML and XQuery means generally accepted common formats for storing and performing transactions on semi-structured and unstructured data. Thus, a "hybrid" database can act like a content manager across all types of data, or an enterprise database across all types of data.

Pair DB2 with WebSphere Information Integrator, and we have what Infostructure Associates calls a "virtual operational store" (VOS): an entity that looks like a single database, stores or caches key operational data with updates replicated to other data stores, and has a global view of data not included in the VOS's store. In other words, such a VOS can mimic to some extent an enterprise-wide database containing all of the enterprise's data, unstructured, semi-structured and structured. Such a VOS can make "information on demand" for a real-time enterprise more of a reality by ensuring that "information" also means semi/unstructured data.

Where IBM has yet to complete such a vision is in two areas: first, FileNet and DB2 need to be integrated so that semi/unstructured data can be stored in whichever data store makes sense; and second, the metadata repository for which Information Integrator provides the key must be extended to storage management and collection of storage-level metadata.


In the short run, the acquisition of FileNet gives IBM an answer to customers concerned that IBM may not support their increasing semi/unstructured data storage needs, compared to EMC. In the long run, however, customers will need a broader solution that realizes that (a) content is much more prevalent than relational data, and (b) much of the value of content is in its relationships with business-critical relational data.

To handle these long-run customer needs, computer companies ranging from IBM and EMC to Oracle, Microsoft, HP and Sun will need to develop database architectures that support and integrate content and relational data in the same data store, as well as integration between separate content managers and enterprise databases. Because of FileNet, DB2 and WebSphere Information Integrator, IBM now owns a head start in delivering this kind of functionality and integration. At the same time, IT buyers should keep in mind that every computer company has a ways to go to achieve full content integration. For one thing, "intelligent information management" is still on the drawing boards.

In the meantime, the best start that users can make towards this goal is to begin to develop a global metadata repository that can include content-type metadata. For example, master data management efforts should include efforts to classify semi/unstructured data related to customer data (e.g., pictures, legal documents or email) as part of master data and as content-type data. As storage metadata is also created, that can be folded into the repository. This will provide a solid base for managing not only content but also content/structured-data relationships across the enterprise, as vendor offerings arrive. In other words, there is no need to wait; some of the long-run benefits of semi/unstructured data can be realized now.

About the author

Wayne Kernochan is president of Infostructure Associates, an affiliate of Valley View Ventures that aims to provide thought leadership and sound advice to both vendors and users of information technology. This document is the result of Infostructure Associates sponsored research. Infostructure Associates believes that its findings are objective and represent the best analysis available at the time of publication.

This was first published in September 2006
This Content Component encountered an error



Find more PRO+ content and other member only offers, here.



Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to: