Three Data Problems Lurking in your AI Strategy

Why your AI programs will fail: Data

Everyone is investing in AI. The ability to automate and improve so many business processes makes AI a game changer for most businesses. Advancements in generative AI have simplified the process to experiment and many firms are running pilot programs, proof of concepts, and full-blown initiatives to roll out AI applications to their business unit or enterprise.

A problem is lurking though: reference data. While the most useful predictive and generative AI models will leverage whatever data you feed them, if the reference data within those data sets is not fit for purpose, the models simply won’t perform.

Consider this. A bank or insurance company wants to develop a next best recommendation model to offer clients products or services to enhance their experience with the institution. If the products and services fed from multiple systems are inconsistent, leveraged different descriptive attributes, or even different terms, conditions, and language, the recommendations will be sub-optimal. Can the AI models provide value? Yes, with large volumes of transaction data, historic decisions, customer behavior and risk models, positive recommendations can be made. However, the accuracy of the models will be limited from their true potential for value based upon fundamental discrepancies in the core reference data of the data sets.

Limitations in reference data are nothing new to regulated industries. Programs have been around for years to address weaknesses and inconsistencies in reference data required by risk and regulatory standards. However, those regulatory standards don’t require institutions to address all data, which is where careful considerations must be made for each AI initiative.

The limitations in reference data are endless. Here are three data problems that are lurking in the reference data and are impacting AI programs today.

Similar data, separate systems

Many AI initiatives are focused on the behavior of customers. Large organizations typically have customer information spread across many different systems by product line, business unit, geography, or even function. While efforts ranging from master data management to enterprise data governance programs may exist to standardize and promote common customer data across the enterprise, many limitations remain. Are the addresses and locations correct? Are the customers categorized or classified consistently? Are individuals vs. entities treated the same across systems? Is demographic information tracked the same? Are records updated based upon real world events (e.g., death, relocation) on the same timeline?

Consider a large bank with divisions including commercial banking, retail banking, credit cards, mortgage, custody, asset management, and wealth management. Are all of the attributes relevant to the same customers across these divisions sufficient to support an AI initiative? The same is true with a healthcare provider that has many different departments and modalities for treating patients and administering care. Is all of the data collected during a patient interaction consistent across such a wide variety of healthcare providers and situations? In any of these industries, inconstancies always exist.

The data doesn’t have to be perfect, but improvements of inconsistencies across one or two key attributes may dramatically improve the reliability of predictive AI models.

NLP / domain-specific language models

Large language models (LLMs) are all the rage. However, most industries have their own unique language to describe the way businesses transact. Curating data sets to derive LLMs that truly represent the language of an industry is essential to delivering reliable AI outcomes.

Let’s use a simple phrase: “head and shoulders”

If you’re a healthcare provider, you may be the describing the location of a pain requiring treatment.
If you’re a retailer, you might be referring to your most popular shampoo product.
If you’re an investment manager, you may be referencing a technical trading strategy.
If you’re simply on the internet, "head and shoulders" may just be an adjective or descriptor for writing.

As we look at the highly regulated financial markets, there is a language for fixed income markets and even the municipal sub-market. Similarly, when considering equities and all of the venues and markets for trading publicly or privately, we see the need for another language. Again, we see another language in the funds market as asset managers create complex documentation to describe all of the terms and conditions for their investment strategies and fund structures.

Each of these distinct use cases requires reference data that fits the overall predictive or generative use cases that are being prescribed. Extracting content out of documents and correctly training NLP to interpret terms and conditions to identify industry-relevant data is a key task in building out reliable reference data.

Missing data / enrichment, remediation

An AI initiative may have the perfect use case, the greatest support, some incredibly valuable data, and may still struggle for one simple fact: missing data. The most valuable systems for training AI models typically have been in use for years and years, which means each system's data can show the history and trends of important behavior. However, the older the data, the higher the probability of missing attributes. An attribute may have been added 5 years ago, which means for the 10 years prior, the attribute doesn’t exist. Or maybe the governance was lacking, which was improved 4 years ago, but data was inputted manually in a very inconsistent manner over the 20 years prior.

Missing data is simply the reality of most enterprise systems. As AI projects look to leverage data, understanding where data may be missing is hugely important. Inconsistently populated fields are not always detrimental to a model; however, oftentimes a significant remediation or enrichment effort is required to populate these fields.

Data Refinement

To deal with the data issues threatening their top AI initiatives, many firms are considering a new refinement approach. Data refinement enables Data scientists to leverage environments to conduct analyses and apply large scale enrichments to missing or inconsistent reference data. Additional capabilities exist to harden NLP models and develop industry-specific LLMs that rely on industry-specific reference data. And lastly, firms are leveraging capabilities to conduct targeted remediation efforts to improve the most important reference data, which in turns improves the reliability of their AI initiatives.

If you would like to learn more about Kingland's approach to data refinement for AI, please contact us at outreach@kingland.com.