Data pros often talk extensively about the importance of data cleaning and data governance to initiatives. But with unstructured data becoming more central to analysis of data from social media platforms, there is some debate about when to cleanse the data. Many data scientists, for example, want to see the data unvarnished, so they can identify outliers and other trends.
At the AIIM New England chapter meeting, we discussed best practices for dealing with dirty data, and Steve Weissman of the Holly Group and president of the chapter was on hand to offer some thoughts.
"It's not so much whether to clean dirty data, but when," he said. "There is value to getting all the raw data in, so that whoever is doing the analysis can make their own decisions about what biases are introduced by the dirty data. The alternative is clean it up first."
According to Weissman, particularly with unstructured data, there may be important information that is difficult to clean up and may be important to retain. Audience members also discussed the possibility of segmenting data outliers, then reintegrating the segment once it's been analyzed.
For more, check out this video.