The Scholar and its Role in the Kingland Text Analytics Platform Suite

My Dad’s favorite advice to me growing up was “Be a Gentleman and a Scholar.” If I didn’t hear that daily, I heard it at least once a week. During the 22 years of my kindergarten-through-college student life, it was a very quick way of saying:

“Be polite and good to other people.” (i.e. Gentleman)
“Work your butt off on your studies in school.” (i.e. Scholar)

After diving deeper into the Cognitive Collector persona and components from our Cognitive Computing Platform suite, let's transition into the Cognitive Scholar persona and components. The Scholar was inspired by an insatiable quest for knowledge inherent within the best, life-long students. Its role within our Cognitive Platform is to read and “study” all unstructured content (data) that has been collected by our Collector components.

Before we get into the details of the specific AI and Machine Learning (ML) techniques leveraged by the Scholar for reading and studying content, let’s take a look at the process flow and scalability we’d expect out of this persona. The needs of the most studious Scholars are quite simple – reliable access to a near-infinite amount of content to study, and then the capacity to read, understand, and otherwise “process” the content. In leveraging these simple needs as our essential, non-functional requirements inspiration, we can see how the architecture and AWS cloud-optimization takes shape:

The Collector’s data lake is leveraged as the input and output source of the Scholar. The data lake provides input to the Scholar such as raw content, basic metadata, any pre-existing tags and annotations on the content (e.g. from prior Scholar processing), and ML training data – all for the Scholar’s processing benefit. Output to the lake includes additional tags, annotations, derived data and metadata resulting from the Scholar’s processing of the content. This output is immensely beneficial to the data discovery, re-training and decision-making processes of our downstream Inventor and CEO personas in our Cognitive Platform's pipeline processing, as well as to the iterative process of improving upon prior Scholar processing.
AWS Elastic Container Service (ECS) is leveraged to orchestrate the individual Docker containers that perform Scholar processing functions. Utilizing Docker containers allow the Scholar to compartmentalize, compose and extend the different capabilities of its specific AI and ML techniques for processing the content (more on this below). AWS ECS provides the management, scalability and performance we need in orchestrating the individual Scholar containers.

Of course, not all Scholar processing is considered equal. Even though we have access to content (Collector’s data lake) and processing capacity (AWS ECS) covered, the extent to which each Scholar container conducts processing can vary widely. Imagine pitting Will from Good Will Hunting to process the content vs. a 3.8 GPA sophomore high school student. There would be quite the disparity in their respective areas of success. Following this inspiration leads us to not only the AI and ML techniques that are leveraged for Scholar processing, but the pre-processing techniques that are utilized to enable AI/ML success.

Examples of Scholar processing include:

HTML, PDFs and scanned document images are the most common formats of content collected by the Collector and generally require pre-processing. In terms of processing content, HTML is easiest, followed by PDFs and images in complexity. For PDFs, we leverage custom Kingland Cognitive Platform technology paired with open source capabilities such as PDFMiner to extract content. For scanned document images, we have to perform optical character recognition (OCR) processing to extract the content, again leveraging custom Cognitive Platform technology paired with open source such as Tesseract. For all formats, the goal is to pre-process and extract the content into a common format for downstream Scholar processing.
Large documents are sectioned into usable chunks, where appropriate. We’ve found this to be especially important in processing regulatory filings, where logical chunks of content can be sectioned from large documents to enable more efficient and accurate processing. One example involves a mutual fund prospectus document filed with the SEC for Voya Target Retirement Funds – this single ~300 page document can be sectioned per Fund, to better prepare for association of extracted data to the correct Voya Fund.
Natural Language Processing (NLP) and Named Entity Recognition (NER) – core capabilities within the AI and ML realm – are executed on the pre-processed textual content to gain the type of Scholarly intelligence that we’d expect from reading and studying the content. For NLP, we leverage custom Cognitive Platform technology paired with spaCy to get all of the programmatic access to textual content that we need, including reliable parts-of-speech tagging, dependency parses, tokenization, sentence segmentation, and more. NER follows suit with a pairing between our Cognitive Platform and spaCy, this time leveraging spaCy’s counterpart Prodigy to train language models tailored for our Kingland Platform data domains related to Entities, People and Products.
Table processing within content presents a particular challenge. While this type of semi-structured content is easy for the human eye to discern and consume, it can trip up even the most sophisticated NLP and NER execution. Table processing within HTML can be a little more straightforward, since the nature of native table tagging within the HTML source gives us hints on the content within. Without such native tagging in PDFs and images, we’re left to a novel Cognitive Scholar approach – training ML models to become great at identifying where the tables are in the document and enable further data extraction. We’ve had success with training convolutional neural nets (CNNs) on color-coded images of documents highlighting the occurrence and structure of tables, allowing us to better pinpoint the data within.

Compartmentalizing each of these difficult techniques allows the Scholar containers to focus on (and improve in) their area of expertise, and not get overwhelmed at the totality of the overall problem set of reading/studying vast amounts of content. As we close with a fun parallel back over to Will in “Good Will Hunting,” there are excellent examples throughout the movie that put his nearly unprecedented Scholarly capabilities on display. However, anyone that has seen the movie knows that his capabilities need the proper guidance and influence around him to reach a new level of rewarding potential. Similarly, our Scholar persona and components need additional guidance from the Cognitive Platform’s Inventor and CEO. Join me next time as we dive deeper into the next component persona of our Cognitive Platform – The Inventor!