Multimodal Automatic Coding of Client Behavior in Motivational Interviewing

This study develops multimodal machine learning models to automatically detect client behavioral codes in motivational interviewing sessions using BERT and VGGish encoders. The research achieves an F1-score of 0.72 for three-class classification of client utterances and explores associations between in-session language patterns and subsequent behavioral outcomes.

Published on October 22, 2020

Authors

Leili Tavabi, Kalin Stefanov, Larry Zhang, Brian Borsari, Joshua D Woolley, Stefan Scherer, Mohammad Soleymani

Read the full paper

Abstract

Motivational Interviewing (MI) is defined as a collaborative conversation style that evokes the client’s own intrinsic reasons for behavioral change. In MI research, the clients’ attitude (willingness or resistance) toward change as expressed through language, has been identified as an important indicator of their subsequent behavior change. Automated coding of these indicators provides systematic and efficient means for the analysis and assessment of MI therapy sessions. In this paper, we study and analyze behavioral cues in client language and speech that bear indications of the client’s behavior toward change during a therapy session, using a database of dyadic motivational interviews between therapists and clients with alcohol-related problems. Deep language and voice encoders, i.e., BERT and VGGish, trained on large amounts of data are used to extract features from each utterance. We develop a neural network to automatically detect the MI codes using both the clients’ and therapists’ language and clients’ voice, and demonstrate the importance of semantic context in such detection. Additionally, we develop machine learning models for predicting alcohol-use behavioral outcomes of clients through language and voice analysis. Our analysis demonstrates that we are able to estimate MI codes using clients’ textual utterances along with preceding textual context from both the therapist and client, reaching an F1-score of 0.72 for a speaker-independent three-class classification. We also report initial results for using the clients’ data for predicting behavioral outcomes, which outlines the direction for future work.

Key Findings

Methodology

The researchers utilized real-world motivational interviewing sessions from 219 college students dealing with alcohol-related issues, collected through IRB-approved clinical datasets with participant consent. The study employed a comprehensive multimodal approach combining advanced deep learning encoders with behavioral analysis techniques.

Data Collection and Preprocessing: The dataset consisted of audio recordings, manual transcriptions, and MISC (Motivational Interviewing Skill Code) annotations for both therapist and client utterances. Sessions averaged 49.85 minutes in length with 41,494 client utterances and 51,802 therapist utterances. Speechmatics was used for automatic timestamp alignment between audio and text.

Feature Extraction Architecture: For textual analysis, the team employed BERT (Bidirectional Encoder Representations from Transformers) to generate 768-dimensional embeddings per utterance, alongside LIWC (Linguistic Inquiry Word Count) for interpretable psychological feature analysis. Speech processing utilized VGGish, a deep convolutional neural network pre-trained on AudioSet, generating 128-dimensional embeddings every 0.96 seconds, combined with eGeMAPS acoustic features for interpretability.

Model Development: The core architecture featured separate encoders for text (fully connected layers) and speech (single-layer GRU for sequence processing) that mapped BERT and VGGish embeddings to fixed-size 256-dimensional representations. For contextual modeling, three preceding client-therapist dialogue turns were encoded separately and concatenated with current utterance representations.

Training and Evaluation Protocol: The study employed one-subject-out cross-validation across 219 sessions to ensure speaker independence. Class imbalance was addressed using weighted cross-entropy loss with weights inversely proportional to class frequency. Models were optimized using Adam optimizer with learning rates of 10^-3 for unimodal and 10^-5 for fusion models, with 10% validation holdout for model selection.

Behavioral Outcome Analysis: For predicting alcohol-related outcomes (Change in Typical Blood Alcohol Content and Change in Alcohol-Related Problems), the researchers extracted sequences of client utterances by MI code type and developed GRU-based models to capture temporal patterns within sessions.

Impact

This research represents a significant advancement in automated therapeutic assessment with broad implications for both clinical practice and computational behavior analysis. The work demonstrates the feasibility of using advanced NLP and speech processing techniques to objectively assess therapy effectiveness and client engagement patterns.

Clinical Practice Innovation: The automated MI code detection system offers therapists and researchers an objective, efficient alternative to manual coding, which is traditionally time-consuming and subject to inter-rater variability. With F1-scores reaching 0.72, the system approaches human-level performance for behavioral coding, potentially enabling real-time feedback during therapy sessions.

Scalable Therapy Assessment: The speaker-independent model architecture allows for deployment across different therapeutic settings without requiring therapist-specific training, making it suitable for large-scale implementation in clinical practice, training programs, and research studies.

Methodological Contributions: The study’s integration of contextual information and demonstration of text modality superiority provides crucial insights for future therapeutic AI development. The finding that historical context significantly improves classification performance establishes the importance of conversational flow in understanding client behavior change patterns.

Research Infrastructure Development: By successfully applying state-of-the-art deep learning models (BERT, VGGish) to therapeutic interaction analysis, the work establishes a methodological framework that can be adapted to other therapy modalities and behavioral health applications.

Future Therapeutic AI Applications: The research lays groundwork for developing real-time therapy assistance tools, automated quality assurance systems for therapy training, and objective outcome prediction models that could enhance treatment planning and intervention strategies. The modest success in behavioral outcome prediction highlights important challenges and directions for future research incorporating personal and environmental factors.

© 2025 Larry Zhang