Putting Human, Decision-Making Intuition in the Text Analytics Suite

Written by Jesse Sommerfeld | 6/5/19 10:30 PM

What if you had access to AI models to assist in making enterprise data management decisions? That was Kingland's final endeavor as we worked through the Text Analytics Platform Suite and its personas.

As Kingland created its Text Analytics Suite, we were intrigued by creating software that could take in a lot of annotation and make a decision regarding enterprise data management.

Working through the aspects of our Text Analytics Suite, we landed on the CEO persona - our toughest challenge to date. Our goal at the outset was to use technical capabilities such as MapReduce and Hadoop to group content from different sources/inputs that were communicating the same thing.

It was a good start, but the initial implementation had less insight into the semantics of the content and more about the makeup. We still had challenges with duplicate information due to the diversity of content and how the same thing can be discussed by separate pieces of content.

The initial model was pretty basic, looking at the word frequency of the text without worrying too much about the order of the words. This is known as the "bag of words" approach. Based on all of the content in the "bag," can the model tell me (e.g. leveraging a corporate actions data management use case) if an article contains an actionable event or not? At this point, the answer was more "kind of," than "yes" or "no."

During this time, all the text we looked at, ultimately, was appraised by humans. We were creating a huge collection of pieces of text with a "yes" or "no" annotation from a human expert and would use this to train a model over the content of the text.

Small Step Forward
We took a step forward using, by today's standard, a very basic model that looked at the text we were analyzing, and trying to judge it from zero to 100. One hundred meant it felt strongly that there was an actionable corporate event, for example. We added in a feedback loop from users and used that feedback to train a feed-forward neural network to further analyze and score.

We added more features to help identify when separate content pieces were discussing the same event, to enhance the accuracy and strength of the Suite. Initially this worked, but we still experienced duplicate information and we lacked a robust way of communicating that a new content piece was being grouped with the others. Plus, newer challenges popped up.

Take, for example, the issue of collecting data from mutual fund prospectus documents. Instead of saying "yes" or "no" to an event, each document, or mutual fund security instrument in each document, would have 300 or more "little" questions to answer. There were lots of mutual fund data elements to find. We discovered that we would often get multiple candidate answers for one question. If we're talking about "portfolio turnover rate" data elements, we might get percentages of both 17 and 45 (as an example), because the language processing part of the Text Analytics Scholar persona wasn't able to completely conclude anything about the information. The Scholar would just identify the numbers as a portfolio turnover rate, and that they belong to some mutual fund. The question we wanted the CEO to answer was, if we're talking about a specific mutual fund, which one of the percentages is more likely to be correct?

If the CEO was provided the choice among a "no" or selecting a, b, or c, could it discern which answer looks like it could be correct?

Using Naive Bayes Classifiers and pulling in other data elements to review, we were able to model the particular statistical distribution of a data element throughout history to understand its trajectory. By combining this and other annotated data with the source of the information (e.g., the particular mutual fund company referenced), we were able to better recognize anomalies that could trip up accuracy. The machine learning model's task was to learn the significance of features such as statistical distribution, understand the source of the information, analyze related information and - with confidence - give an answer.

We knew we were on to something when we tuned the CEO to review these types of candidate answers, accurately relate it back to an entity (e.g. mutual fund) and the history of the attribute, and take all of that into consideration to provide an answer.

Context is King
In 2018, the improvements made to our upstream processing methods drastically increased the reliability of using Named Entity Recognition and Event Detection for semantic equivalencies across content. The normalization and tagging of data allowed for more sophisticated extraction to occur. For example, the CEO can interpret statements into an attribute value intended for a data model schema. The CEO can assist in associating tagged information from the Scholar to an appropriate business context. We're moving toward the CEO representing the intuition of a subject matter expert, which will augment analysis and decision making for clients.

These improvements and more to the CEO, combined with the data we now have, will help us continue to improve how clients efficiently get insight from unstructured content and data. Perhaps the CEO of the Text Analytics Suite tackles credit risk monitoring use cases next, comprehending what's transpiring because it understands the specific event(s) mentioned within content and the entities/people involved, if relevant. And after automating the assessment of that information and more, the Text Analytics Suite would send out a credit risk monitoring alert.

The credit risk use cases and others keep us pursuing new ways to create decision-making intuition in AI software.

Learn more about the Text Analytics Suite and stay tuned for an announcement later this month.

View full post