determined - Fotolia

E-discovery software brings serious ROI to information governance

Predictive coding tools have serious ROI beyond litigation, because employees can find what they're looking for -- quickly.

As the volume of information proliferates exponentially, companies are losing productivity. Sifting through terabytes of data by hand is no longer possible. So they are turning to technology to solve the problem.

Once the province of e-discovery, technologies like autoclassification and predictive coding tools software have made their way into information governance strategies, because companies need a better way to sift through the information they house than reviewing volumes of data manually. Companies now recognize that these tools have far-reaching application beyond the world of e-discovery. They can help companies organize and manage information so that it's accessible and searchable, which can also save serious time and money.

Using technology to make information accessible and searchable has real ROI. According to McKinsey data, the average interaction worker spends an estimated 28% of the workweek managing e-mail and nearly 20% looking for internal information or tracking down colleagues who can help with specific tasks.

"If you're relying on individual employees to classify their records, it's just not going to work," said Susan Wortzman, the founder and president of Wortzmans, a consultancy specializing in legal and strategic issues related to digital information, in Toronto, Canada, and a speaker at the InfoGovCon conference in Hartford, Conn. "What we're finding more successful is using the technology to autoclassify the records so that employee input is minimized."

Wortzman sat down with SearchContentManagement to discuss the importance of tools like autoclassification and predictive coding.

Lawyers used to review the documents themselves -- and manually. Do you find resistance to enlist technology-assisted review in the process of e-discovery?

Susan WortzmanSusan Wortzman

A lot of law firms aren't ready for it because they don't understand it. In Canada, particularly, privileged information is sacrosanct. Lawyers feel like, "If I let a machine do it, how can I ensure that privileged communications aren't going out the door?" There's that discomfort that law firms struggle with still.

But now it's come to a point where the volume is too great, and you have to make business decisions. Lawyers are now saying, "We're going to have to let the technology help us classify the records and identify the records that are privileged. We can't have eyes on everything."

What are continuous active learning tools?

Some of the earlier machine learning tools, or predictive coding tools, you would train the computer as to what is relevant and what wasn't. And then it would spit out two buckets [of documents that pertain and those that don’t] and say, "These are likely to be relevant records and these are likely not to be relevant." You would train it and train it and then you would stop and say, "The computer has enough information to make decisions now."

Continuous active learning functionality is different because reviewers train computers in the same way, but they keep going. You keep feeding the computer more information every time a decision is made about a record's attributes -- responsiveness, privilege, whatever the attributes are -- that information is continuously updating the system. It's learning more about the records as it goes.

How could these tools aid the process of information governance and categorization more generally?

That is key. We're using it now in e-discovery, but think about how amazing it would be in information governance. Every time someone looks at a record and say[s], "This is a contract; we're going to classify it as a contract and [it] relates to this deal," you're giving the system even more information about your data and getting better and better as it goes.

You set up rules to classify the kinds of information you're bringing into the system: These are contract[s], leases, [and] pleadings. And you are adding more and more information as records are classified.

What about the idea of combining predictive coding with crowdsourced, lower-priced labor, as Matt Lease has suggested, to search documents more efficiently and cheaply? Then high-priced lawyers don't have to review all the documents that might be part of a lawsuit.

Law firms are doing this all the time now.

Is there resistance to this approach?

Clients are embracing it because it's a cost savings. The resistance from law firms has been, "We want to make sure everything is reviewed for privileged [documents] and we're signing off on this in Canada; we have to swear in affidavit that we've looked at everything." If you have a team in India doing document review, how can you as a lawyer in Toronto, Canada, say, "I can certify that all of these documents are relevant and not privileged"? It's pretty tough to do.

But the business reality is that clients are saying, "We have to do it this way. We can't pay high-priced lawyers to sit and review hundreds of thousands of records anymore." Now firms have accepted that it's a workable model. If they do some QA on their own and some second-level review on their own, it's becoming a pretty standard model. When you start with a million records, and you ultimately have to review only 10,000 manually, it's an improvement.

How do law firms certify that the documents are a statistically valid sample?

Before we had conceptual clustering and analytics tools and predictive coding, we would use keyword searching. So, for example, we'd start off with 1 million records, remove duplicates, filter by date, then have 500,000 records. Then we'd do some keyword searching and search the 500,000 records and end up with 100,000 for review. For the 400,000 that were not set aside to be reviewed because they weren't responsive to a keyword, we would take a statistical sample and review them to ensure there wasn't anything relevant. And if there were documents, we would adjust our keyword search to pick up those kinds of records.

Today, there are other tools. I wouldn't say we're not using keyword searching -- sometimes it's altogether appropriate. But it's a toolbox. Sometimes you want to use a hammer, sometimes a screwdriver and sometimes a wrench. Sometimes keyword research is perfectly appropriate, sometimes predictive coding is, and sometimes another tool is. It's a matter of understanding your records and determining what the best tool is in a particular case.

With other tools, you're not looking based on a certain word; you're looking for a concept or a certain class of records. If you're involved in tobacco litigation, you can't search the word cigarette, because it will pull in everything. You can pull in a lot of false positives.

How can these approaches have ROI beyond e-discovery for companies?

A lot of these tools were developed in the e-discovery context, but they can be used more broadly so that organizations can know their information. There is a huge business value to having this info at your fingertips. To search through and find the information you want and get it so quickly.

If you don't have a good handle on your information and your competition does, they're going to pull ahead of you in the marketplace. If you understand your information, you can use it to innovate. Now I have learned this about my market, and I can go out there and be first to market in solving this problem.

For more on information governance strategy and #InfoGovCon2015, check out

Next Steps

Records management and e-discovery guide

Records management gets a fresh face with technology

Dig Deeper on Information governance best practices