This article is part of an Essential Guide, our editor-selected collection of our best articles, videos and other content on this topic. Explore more in this guide:
3. - Videos related to enterprise content analytics: Read more in this section
- Knowing when to cleanse dirty data
- The smarter approach to information management
Explore other sections in this guide:
- 1. - Trends in ECM analytics
- 2. - SharePoint analytics
- 4. - Important terms related to analytics and ECM
The challenges of dirty dataDate: May 14, 2014
Data pros often talk extensively about the importance of data cleaning and data governance to initiatives. But with unstructured data becoming more central to analysis of data from social media platforms, there is some debate about when to cleanse the data. Many data scientists, for example, want to see the data unvarnished, so they can identify outliers and other trends.
At the AIIM New England chapter meeting, we discussed best practices for dealing with dirty data, and Steve Weissman of the Holly Group and president of the chapter was on hand to offer some thoughts.
"It's not so much whether to clean dirty data, but when," he said. "There is value to getting all the raw data in, so that whoever is doing the analysis can make their own decisions about what biases are introduced by the dirty data. The alternative is clean it up first."
According to Weissman, particularly with unstructured data, there may be important information that is difficult to clean up and may be important to retain. Audience members also discussed the possibility of segmenting data outliers, then reintegrating the segment once it's been analyzed.
For more, check out this video.