Building Hyper-Focused Unstructured Data Libraries for Clients

Books, magazines and other documents fill libraries across the globe. Staff at these libraries seek out new, interesting literary content and collect the work within their walls for use by students, scholars and professionals. Extracting relevant information and storing it in an accessible, centralized location was the challenge for the Text Analytics team at Kingland. And they wanted a name that described exactly how it worked.

Developing a solution or product name is difficult. What should it convey to users? Should you use real words with a twist? With an estimated one million English words to choose from, Chief Technology Evangelist Matt Good and team had to select the best words or phrases to simply describe how the Kingland Text Analytics Suite worked.

For Kingland, the team's decision to create personas came about organically. Good wanted names that would easily resonate with clients and reinforce the key elements of the Text Analytics suite.

Using a persona-based approach has been a successful and simple way to describe how the suite solves complex enterprise data use cases.

From Harvester to Collector
"We were originally working on the architecture of our Text Analytics offering and the logic behind it," Good says. "We wanted to answer questions such as what does the architecture look like, what would the data processing flow look like? And in the early days it became obvious that we should initially call it Harvester because we were bringing in content to get it ready for further processing."

The Harvester started out as more of a concept of logical parts that the team put together. Hard coding made it work, but the suite wasn't efficiently using a data lake architecture yet and the team knew this part of the suite did more than simply harvest content. It initially wasn't as expansive as the team originally planned, as it was crawling specific websites to scan for hints of corporate actions.

But the Harvester name didn't truly point to its potential capabilities. According to Good, the analogous concept of harvesting didn't quite fit because, from a farming perspective, harvesting goes through its process and crops eventually get consumed, but then they're used or delivered elsewhere. The end consumer in that analogy doesn't have access to what was originally gathered - only the end results of the process. This analogy is unlike the Text Analytics suite the team was creating, which kept all the gathered content for further processing and analysis. Ultimately, proper access to all data is important. A recent Forbes article noted that a 10 percent increase in data accessibility will result in more than $65 million of additional net income for a typical Fortune 1000 company.

Despite struggling to come up with a better term, the team continued to work through the architecture and began to notice logical groups of capabilities or componentry that would support bringing content in, processing it, and delivering that to consumers. In this instance, consumers could be client systems, UIs, other callers of APIs, etc.

The team included features such as adding new entities for monitoring or adding new sources required for that entity within 24 hours, and improved the corpus to include sources that cover all entities of interest. These features and more - reading data from RSS feeds, processing more challenging content within PDFs, and even reading data in an image format - morphed what started out as The Harvester into The Collector.

The Collector name was better because it logically fit what the team was trying to do by bringing in content and leveraging a fully-built data lake architecture for content management. "Think of the lake as a vault," Good says. "You have a place where you can reliably manage and store your unstructured data and metadata in a secure environment." In many ways, this is similar to a bank where money is reliably and securely stored, managed and organized.

Kingland expanded its capabilities by fully embracing the cloud for its limitless capacity, deployment speeds and reliable and auditable security capabilities. By optimizing the Text Analytics suite for cloud deployment, Kingland enhanced its ability to bring any amount of content in and manage it in a way that's needed for further downstream processing. This cloud-optimized ability to bring content in for reliable management continues to be a critical need, as one-third of all data will exist in or pass through the cloud by 2020, according to reports.

Growing Importance of Unstructured Data
IDC predicts 80 percent of worldwide data will be unstructured by 2025. One of the updated aspects of The Collector over its previous Harvester generation is the ability to enable a downstream data lake-aware component design that takes compatible input documents and groups them into chunks of text or cleansed sentences that can be used for input by Natural Language Processing modules. This capability also supports the necessary conversion of HTML, text, and PDF files.

PDFs in particular can be troublesome. Captions, figures, headers, tables and footer data are all features that cause problems for extracting data from PDFs. The Collector can leverage the downstream data lake-aware components for converting PDF documents into machine-readable form, enabling further processing and text extraction by another persona - The Scholar. As part of this hand-off to downstream components in the processing architecture, OCR can be leveraged as well to help identify data and tables that are traditionally more difficult for software than humans to find.

"If you think about The Collector, it doesn't really care what the content is," Good says. "It just wants to bring it in, store it, manage and secure it." What's important is that it's designed and implemented in a way that can always associate metadata with content, enabling powerful search and retrieval of the right content, at the right time.

Essentially, The Collector wants to retrieve and hold as much content as possible - like a library - for hyper-focused use cases.

Look for the story behind The Scholar persona next.