Analyzing Unstructured Text with Probabilistic Topic Models and Applications to Computational Psychotherapy Research

Mark Steyvers, David Atkins, Zac Imel, and Padhraic Smyth

Abstract

The ability to automatically understand large collections of unstructured text is a challenge across many disciplines. Consider the problem of being given a large set of emails, reports, web pages, or transcripts and wanting to quickly gain an understanding of the key information in this set of documents. What are these documents about? Who is mentioning what topics? What topics are typical for certain groups of people associated with the documents? Unsupervised learning techniques, such as statistical topic models allow to automatically extract this information from sets of documents without using predefined categories. In this talk, I'll review the basic topic model and a variety of extensions such as the Labeled Dirichlet Allocation Model (labeled LDA), a machine learning approach for multiple-label document classification. This model allows us to analyze the connections between words in documents and sets of content labels associated with documents. We demonstrate the utility of the model on a large collection of transcripts from psychotherapy sessions between patients and therapists. Each session is associated with labels related to the subject of conversation as well as the symptoms displayed by the patient. We assess the predictive accuracy of the model in assigning labels to new sessions. We also assess the ability of the model to find (local) talk turns in the dialogue that are representative of the (global) session level label assignments. Overall, this computational approach to psychotherapy research has the potential to scale up the study of psychotherapy to thousands of sessions at a time.