Using Text Analytics to Reveal Collusion in the Classroom

A large data project happened in an abstract way for Kingland a while back.

Just off the corner of the Iowa State University campus, Professor Dr. Brad Shrader was involved in a conversation with team members from Kingland and other ISU faculty regarding research projects. His team was working on two seminal projects - a collusion study involving numerous interviews, and a governance research project involving the organization of data from 10,000 research papers.

When Shrader referenced the amount of time this project could take, Kingland Chief Strategy Officer Tony Brownlee's eyes widened as he gave a nod to CEO David Kingland and said, "I think we can help." Brownlee shared how Kingland was using Text Analytics software to automatically and quickly discover specific words or data sets.

The company has used its AI prowess to process and update the Text Analytics mining of data from approximately 27,000 securities and more than 5 million cumulative data points, with an accuracy rate above 99 percent.

The Project
Shrader was interested in Latent Dirichlet Allocation (LDA). LDA is a Text Analytics technique that can determine or select sets of words and place them in a specific topic, matching specific words with topics. For example, let's say you have 10 topics spread throughout 50 documents, and the analysis puts the corresponding phrases or words in each bucket or topic. Going further, the technique can inform you how strongly each document pertains to the prescribed topic. For the collusion project, Shrader wanted to discover the underlying themes from the text of 26 student interviews.

The biggest hurdle faced by Shrader was his team's ability to extract and analyze unstructured data.

"Individuals have to go through and pull out sections of text," he said. The average person can read a 400-page document in seven hours, but the person's retention of information within such a document would certainly be questionable. Add thousands of documents and it's easy to see this is "very labor intensive, and we'd also have to assemble the information to our liking, which could take months or longer."

Tack onto that, Kingland had to decipher the differences in language between the colluders and non-colluders and agree on a number of topics. This is difficult because the topic structure is unknown. The algorithm must accurately observe the documents and words and infer topics from the words it discovers.

As the algorithm inspects the documents, it needs to intuitively discover if the topics appear as if they're coherent. This is based more on probability than on an explicit identification.

Fortunately, for Shrader and Kingland, the team determined the most relevant words that were exclusive to five topics. The algorithm correctly identified who had and hadn't colluded by parsing out and compartmentalizing phrases into the selected topics. And what did Shrader discover?

The Findings

He said one of the big findings has to do with teams. Whether it's in academia or in business, "there is a dark side. You have to be careful when creating them. There's potential for teams to become confused and not do their own work when they're supposed to." He added that perhaps - in business and in academia - we're creating a context that fosters cheating.

As for Kyle Hansen, a lead engineer on the AI team at Kingland, the business of topic modeling is still relatively new. He said there's potential for topic modeling being used in the retail space, for example. He explained that retailers could use this topic modeling technique to help clients understand if a customer is speaking about a specific company promotion, pulling text from social media posts that relate to the company's targeted promotion. And for other industries such as finance, it's a way to classify documents from annual reports to a 10-K, 10-Q and more.

With the more than 2.5 quintillion bytes of data created each day, Hansen believes discovering abstract topics from this sea of information is the key to discovering hidden semantic structures and building technology that can accurately connect and discern words with similar and multiple meanings.