The Challenge of Understanding Context in Content

Mike grabbed his book.

A simple sentence to everyone reading it unless you're a machine. As humans we understand that the word 'his' references Mike. A machine may not.

We're talking about anaphora. It's the concept of co-reference, where you can say, "Mike grabbed his book." The fact that 'his' references back to Mike is the challenge in language processing around the resolution of the pronoun 'his' back to the name Mike.

At Kingland, we believe connecting concepts together will be really important as text analytics continues to evolve. "Aliases and references back to other things provide context and better understanding," says Chief Technology Evangelist Matt Good.

"We feel it's important that we understand what the content is talking about," he says, "especially as we improve how well we process entities, people and events from various sources." This includes identifying and extracting these types of data domains out of the content that's processed.

With that in mind, Good and the Kingland Text Analytics team hosted a machine learning competition open to students from Iowa State University that lasted from December 2018 to April 2019. The problem was to find a solution that could recover mappings from pronoun to antecedent/postcedent with the highest possible F1 score (the measure of a test's accuracy).

The problem set provided was challenging as very few teams completed the competition, with two teams announced as winners.

"We saw lots of interest and many inquiries, but it's challenging when you put this concept into action and try to build something real with it," says Good.

The two winners approached the challenge differently. One team leveraged prior research that had already been completed and wrapped a good solution around it with great documentation. The other team ... lots and lots of custom code. "They did a shotgun blast of different algorithmic approaches to solve the problem with different levels of accuracy. They then tested out multiple theories of trying to pick the best one and provide code for that," according to Good.

The accuracy level of both approaches was so close that a definitive winner wasn't a simple choice. Overall, this competition was a great recruiting opportunity for Kingland to work with potential candidates and continue our partnership with ISU.

The company plans to continue collaborating with ISU to further the evolution of artificial intelligence and text analytics.

Kingland's Anaphora Challenge and Machine Learning View

Lead AI Designer Kyle Hansen believes the anaphora challenge is all about teaching machines world knowledge - non-linguistic information that helps someone interpret the meanings of words and sentences.

Hansen says people have come up with ways to label parts of a sentence based on semantic roles. For example, in a merger/acquisition corporate action, 'this' is the acquirer and 'this' is the company being acquired. Success with interpretation of 'this' depends on the expertise of the person teaching the software.

"You have to have the best subject matter experts (SME) with the right domain experience to close the gap with rules," he says. That person can set thresholds on continuous variables that can be between zero to 100, because the individual understands and sees patterns manifest in the real world.

Hansen says SME knowledge can wipe out potentially large areas of problems. For example, an SME could identify an attribute that is difficult to collect. She knows enough about the business and industry that if the attribute has changed, it should definitely be examined by a researcher before it goes into production with the customer. Without that SME knowledge you could unknowingly put that attribute out there without understanding the potential impact of getting it wrong. That's a red flag that an SME can foresee.

But, not necessarily one that a machine will accurately capture.

You could solve the problem by having an expert who tries to fill a database with facts, formalizing this knowledge as some sort of data structure in a database, but that's a tall order, says Hansen. "You'll work on that forever and it'll never be even close to complete enough." He says the difficulty lies with how much detail you'll really need and if the information being entered will be relevant for a specific industry or evolving use cases.

"My interest is in building world knowledge in an automated fashion and taking advantage of big data," says Hansen.

That's why Kingland tackles problems such as getting machines to accurately co-reference pronouns with nouns.

"It takes a commitment to overcome the challenges and it takes passion for how we can use it for actually solving a use case," says Good. "We don't create AI just for the sake of cranking out a great model or algorithm. We actually try to build something valuable for our clients."