How The Scholar Scaled from Processing Thousands to Millions of Entities

Everyone thinks their baby is beautiful.

It was 2010, and machine learning technology, Artificial Intelligence and Natural Language Processing were buzz-worthy terms. IBM Watson was about to make history on Jeopardy and Kingland was dipping its toe into the waters of this technology.

At Kingland, "we were working on a text analytics solution the company internally named Eureka," says Chief Technology Evangelist Matt Good. "The AI team hadn't yet created personas to describe how different aspects of the suite worked, but were focusing on annotating things in an unbiased fashion to allow downstream decision making."

Good's team eventually settled on calling this persona The Scholar. He likens The Scholar as someone in a legal office who highlights important parts of a document and then hands off the relevant information to a decision maker.

At the time, The Scholar used string-matching algorithms to find text. "We stripped out all the HTML formatting from the content and used sub-string matching to find event keywords and company names," says Good.

The Scholar was tasked with searching for 400,000 legal entities (e.g. companies) in every document. A herculean task for a human, let alone a machine. This was a technical challenge to search for nearly half a million entity names using an approach you might liken to CTRL+F. Time consuming and difficult to scale.

"I remember being told from the analysis side of the team that 400,000 entity names is nice, but we want it to do 5 million," says Lead AI Designer Kyle Hansen. The algorithmic challenge of searching for millions of names is significant.

Hansen recalls the team spinning up memory to gain CPU. "Basically, we were building a huge machine, feeding it a document and if anything in the document matched any of the pre-determined entity names - the machine would light up. It's like reading a document to a room full of people who each listen for their company's name, and if they hear it, they raise their hand to record a mention."

Improving the way The Scholar processed data was important to reach the goal of 5 million names. The team augmented sub-string matching with a 'sliding window' approach, identifying sentences within paragraphs. The Scholar searched for any text that began with an upper-case letter and ended in a period, question mark or exclamation mark, followed by a space and upper-case letter. An improvement from the initial approach, but limited by its ability to obtain insightful results from RSS feeds and HTML documents. This also increased the amount of 'noisy data' which increased the amount of time end users took to review and further process the data.

At the time, the team was experiencing a 35 percent accuracy rate, but believed they could increase accuracy to 55 percent by implementing an update to how the suite segmented sentences. Segmentation moved the company closer to creating something that not only read content, but understood the context. This also solved the RSS feed and HTML challenge.

The team used named entity recognition (NER) to tackle scale and accuracy challenges. NER involves locating and classifying named entity mentions in unstructured text into pre-defined categories such as a person's name, organizations, locations, dates, and more.

The AI and analysis teams figured out that instead of looking for 5 million pre-determined names, "we can just look for entities and then see how those relate to the domains we're interested in," says Hansen. “That was a game changer because it means you can find events for companies that maybe you weren’t looking for, but probably should’ve been.”

With no prior ideas about what to search for, The Scholar was now retrieving information and highlighting entities and people in unstructured content with greater accuracy. Insight started to come in from domains that clients and teams hadn’t previously paid attention to.

"We've seen stock language models, which are state of the art, grade out with an accuracy rate around 59 percent when tested on corporate/finance news and event content," says Hansen. With careful training, Kingland's custom language models have achieved almost 90 percent accuracy, he says.

"We started with corporate actions, moved to attribute detection and now we're creating fertile ground for a much broader domain of potential use cases we want to solve," says Good. "In fraud detection for example, The Scholar could provide a lot of data points without making any judgments and another module could identify a suspicious pattern in this data."

The Scholar's current abilities to detect content formats, classify languages used throughout documents, extract out isolated text and classify what type of text it is, and execute any pre-processing such as OCR, text isolation and table isolation, has improved how others see the AI team's baby.

Stay tuned for a look back at the final persona of the Kingland Text Analytics Platform Suite - The CEO - next month.