The Inventor and its Role in the Kingland Text Analytics Platform Suite

One of the greatest movies about invention over the past 35 years has to be “Back to the Future.” Who could ever forget about time travel via the flux-capacitor-equipped DeLorean in this pop culture classic? The thrill of discovery and persistence through initial failures is what drove Doc Brown’s pursuit of his invention throughout the entire movie.

The Inventor Persona of the Kingland Cognitive Suite

The Inventor, our third text analytics persona in this series, was inspired by this same thrill of discovery evident within Doc Brown’s passion for invention, with even more relevant Kingland inspiration coming from our passion for data. The Inventor’s role within our Text Analytics Platform is to discover literally anything of interest from a “mountain” of data – trends, relationships, quality issues, and more.

When compared to the other Kingland text analytics personas and components, the Inventor is a bit of an outlier within the Text Analytics Platform, in the sense that it can often be an out-of-process data discovery mechanism separate from a traditional Collector->Scholar->CEO content processing pipeline. It can also be considered an outlier from the perspective of Malcolm Gladwell’s book “Outliers,” and more specifically the 10,000-hours-rule of practice for becoming world-class in a field. In our scenario of data discovery with the Inventor, the 10,000 hours analogy equates to data – the more data the Inventor has to “practice” over, the better its results will be, and the more opportunity it will have to turn its data discoveries into yet another module within the Text Analytics Platform’s hardened processing pipeline.

How Does The Inventor Work?

Pairing Kingland’s best data analysts, scientists and engineers with proper techniques and technology give us the best Inventor results for leverage within our Text Analytics Platform. In the past, the “technology” usually meant complex SQL executed against a relational database. Today and going forward, it means leveraging the best AI and ML techniques and Python libraries against the data, suited for the discovery needs at hand. In some cases, there are third party data science and ML platforms that can accelerate discovery, such as Knime, and the new AWS SageMaker platform.

As an example, let’s take a look at a client data management use case near and dear to Kingland’s heart – Data Quality Remediation:

Data Quality Remediation, in the Past: A client came to us with a data quality challenge over their 20+ million customer/counterparty data records. With all of our prior data quality remediation experience (albeit on smaller data sets), we paired our best data quality experts with our best and most complex SQL engineering and got to work on the data. It wasn’t too long before we were finding data quality issues, trends in missing data, and lots of duplicate data. The quality remediation process was highly iterative, leading to a series of data quality discovery "inventions" that would enable acceleration in the remediation process in future iterations and on future data sets. However, there are limits to the data quality remediation capabilities of traditional SQL engineering over this larger volume of records, which leads us to...

Data Quality Remediation, Today and in the Future: Our expertise and experience from the past has led to the Inventor-style “inventions” that continue to become more commonplace within our data processing architecture and pipelines. Additionally, more accessible machine learning techniques and technology have enabled a deeper level of data quality remediation than was possible in the past. Instead of using complex SQL engineering over the client’s 20+ million data records, we can instead use a combination of machine learning capabilities to further enhance our data quality remediation capabilities – for example:

Unsupervised machine learning models and algorithms such as general clustering, dimensionality reduction, and auto encoding (all leveraging Python libraries) help data quality remediation efforts by finding the data quality trends requiring additional focus, beyond that which was possible with traditional SQL engineering.

Matching and de-duplication remediation efforts can be enhanced with a novel, combination approach of using bag of suffixes vectorization and k-nearest neighbors clustering, again leveraging available libraries for Python. Results from this enhanced approach can be compared with traditional, SQL-based matching/de-duplication, providing the best opportunity for accurate and continually improving results.

In another example, more related to our Text Analytics Platform’s processing of unstructured content, we paired engineers with NLP techniques such as topic modeling via Latent Dirichlet Allocation (LDA) to find the similarities between thousands of university research papers. We were able to highlight the common topics that were similar among the papers, group them into their categories of similarities, and enable the opportunity to find common reference citations. This type of topic modeling has also shown to be valuable in use cases such as news monitoring or risk surveillance, where the modeling can provide a quick breakdown of the topics within any given document or article, driving efficiencies into the monitoring/surveillance processes. In our Text Analytics Platform’s pipeline of unstructured content processing, this type of content similarity processing via topic modeling can be a valuable module within the pipeline for classifying content.

In both examples, we see a theme of pairing Kingland team members with the best techniques and technology suitable for a given data discovery task. Our best Inventor results, like the discoveries from these examples, have the opportunity for becoming hardened modules within our overall Text Analytics Platform processing pipeline, as long as they have the time and data available for realizing the greatest benefit – no different than the greatest inventions need time and data to become production-ready. However, I’m still waiting for a reliable hoverboard, let alone a time-traveling DeLorean! A key theme that resonates throughout our Text Analytics Platform’s personas is how they work together and complement each other to form a complete solution.

Join me next time as we dive into the final component persona of our Text Analytics Platform, the one responsible for making the big decisions – The CEO!