BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
Records management is not for the faint of heart. The volume and rapid growth of data can be overwhelming for information...
managers, and a range of business and regulatory requirements make it imperative to do the job right. Mistakes can be costly.
In the case of a law firm's client or a hospital's patient, records management is a "big deal," noted Peter Perera, president of The Perera Group, a consultancy in Los Angeles. For example, he said, the scope and complexity of records management becomes clear when a large number of contracts and agreements are involved or multiple versions of documents exist.
Companies also hold on to information they don't need. "Storage is sufficiently available, cheap and reliable that organizational retention policies no longer have to be driven by storage [cost] considerations," said Seth Grimes, principal consultant at Alta Plana Corp., a research firm. Furthermore, regulatory and policy considerations remain a big driver for retention and can result in too much retention.
In recent years, new big data technologies have been tapped to address this challenge, adding insight to the retention and management capabilities an organization already has in place. While not a perfect solution, better awareness and analysis of a company's available information can help prevent the problems incurred by over-retention.
The need for better data visibility
Carol Stainbrook, executive director at Cohasset Associates in Minneapolis, who works with clients to craft records management policies, said that from a records management or information governance standpoint, she encourages clients to focus on retaining information that supports business operations. "The organization generally isn't focused on records management per se; it exists for other goals, such as making a profit," she noted.
When it comes to data generated outside an organization's systems, for example, Grimes noted that business use drives collection and retention policies -- not compliance. "I'm referring to text, images, video and associated metadata posted to online and social platforms," he explained.
Of course, to varying degrees, organizations are part of the compliance structure, so for other information, they have to be aware of regulations and retain accordingly. In addition, some information has legal value to an organization -- and some information can represent legal risk.
Stainbrook said that the ultimate goal is to have appropriate information to support internal business processes while avoiding over-retention of information that may increase risk. "We work with business intelligence and analytics to help companies make sure they only retain a subset of information to satisfy business information needs. Sometimes, it is a matter of simply cleansing the data, for example by stripping out names but keeping the records themselves. … Big data, faster search and better business intelligence capabilities all play a growing role in that process," Stainbrook said.
And more discriminating retention policies will become ever more important, she said, with the explosive growth of unstructured data. "Based on our client metrics, we are finding 40% to 60% growth, compounded annually, in unstructured records. Some of that is the result of more complex information leading to larger file size, and some of it is simply more files and more content, such as emails," she said.
"On the consumer-facing side, we see examples in financial services and insurance where companies need to analyze their information more effectively" to better understand risk and demographics, she explained. But it isn't just client and transaction records that present challenges, she noted. The use of "unstructured analytics" is burgeoning in the area of product development. "Scientists …may need to know everything about a particular molecule and how it has been analyzed and studied in the past," she said, so they can rediscover important information for development and avoid duplicating effort.
Using data analytics tools to sort information
To accomplish these tasks, Stainbrook said organizations are turning to analytic tools that examine metadata and then to those that ingest information and "crawl" through the content in an effort to understand its value. But "being able to identify and analyze metadata is usually a good start," she said.
Grimes said that Hadoop, a framework that enables big data processing, has been enlisted for records analysis, too, whether by a firm's own staff or as part of a software or service solution that an organization licenses. But while Hadoop is great for large-scale information extraction from text, he noted, "It's only one of several implementations of parallelized processing, and it's not the best answer in most text-processing situations."
More on information governance
Why information discrimination is problematic
Good records management starts with information governance
Information governance for "paralyzed" records managers
According to Grimes, for text and other data forms, there are essentially four prevailing big data approaches: Use Hadoop, extend it, coexist with it and use an alternative. "IBM Big Sheets fits in my category two, providing a Hadoop front end, although I'd characterize Big Sheets as not fully productized; Digital Reasoning's Synthesys is another 'extend it' example," Grimes said.
"Other commercial alternatives straddle my categories three and four and include Teradata Aster, EMC Greenplum and a solution from Hewlett-Packard Vertica that is not on the market yet," Grimes said. In category four is the high-performance computing cluster (HPCC), an open source "massive parallel-processing computing platform that solves big data problems," Grimes noted. HPCC is a LexisNexis spin-off.
"The challenge for a commercial provider is that Hadoop software is free and they are not. So they offer Hadoop hooks, ally with Hadoop distribution providers such as Cloudera and Hortonworks, and focus on other elements that make them special and justify their price tags," he added.
However, Perera cautioned that big data technology may not have the "exactitude" to meet compliance requirements as it relates to unstructured data, though it can play a role.
Organizations may not have to collect and archive online and social media for regulatory and policy reasons, Grimes said, but might choose instead to use a feed provider such as DataSift, Factiva, Gnip, Moreover, Spinn3r and Xignite, whether streaming or on an occasional, on-demand basis, to save anything of potential value.
"The crux of what is happening is that we have more sources of information and have become more dependent on knowledge workers," Stainbrook said. As a result, there is more need to have visibility into information connected to the business, while compliance tasks require you to "keep what you need and delete what you don't need." The idea that it is "cheap" to store data does not take into account the whole picture, because there are costs and risks involved with maintaining and managing that information, she explained.
"As you factor those additional costs into your IT budget, you will find records management can get very expensive; that's why companies need to recognize that they need to cull data on a regular basis," she added.