I was recently at Gartner’s Data and Analytics event participating with thousands of other data nerds and technologists on the latest and greatest in the world of data, AI, and analytics. Ted Friedman, one of Gartner’s leading analysts had this to say:
“Data quality…that age old problem…is still on the radar.”
We’ve heard the same thing from most of our clients.
Will we ever NOT have data quality issues? As long as data is being created, the harsh reality is, NO. Data quality issues will always exist. But why?
The reason data quality issues will always exist is because new data is created with different intent or context from the old data. When we try to compare and the data differs, which data is correct? Is that the right name? Is that the right address? Is this the correct amount, value, or date?
Let’s use a legal name of a company as an example. In one system there may be a client named JP Morgan. In another system, there may be a client named J.P. Morgan Chase & Co. Which is “right”?
As Segal's Law goes: "A man with a watch knows what time it is. A man with two watches is never sure." It turns out that this well known adage for time, is also true for data.
For data, "right” is in the eye of the beholder or the user. In our history, to get to the answer of which is right - data quality - you oftentimes must use a third source. This third source? A document. Did you know knowledge workers spend 50% of their time hunting for data, finding and correcting errors, and searching for confirmatory sources for data they don't trust?
This is why text analytics is so important and a big reason we’ve chosen to invest so heavily in text analytics. Documents are the tie breaker. From legal agreements, regulatory filings, policies, official notices, to news articles, these unstructured sources are almost always used to inform data quality. The problem, though, is they are much more difficult and time consuming to find, read, and use than it is to simply find another watch. Having a data steward search through hundreds of documents and read thousands of pages isn't sustainable. However, with text analytics capabilities, software can scour millions of documents and read hundreds of thousands of words in minutes to help triangulate the sources.
With data quality being a constant for the foreseeable future, text analytics is becoming a must have capability to scale enterprise data efforts. Training natural language models on the unique data attributes and rules for the next new application provides a scalable way to more quickly identify data quality issues and deploy quicker.
Text analytics is becoming one of the most important capabilities for any company building data-driven processes and applications. Is text analytics in your data quality arsenal?