Multimodal Automatic Coding of Client Behavior in Motivational Interviewing

Authors

Leili Tavabi, Kalin Stefanov, Larry Zhang, Brian Borsari, Joshua D Woolley, Stefan Scherer, Mohammad Soleymani

Abstract

Motivational Interviewing (MI) is defined as a collaborative conversation style that evokes the client’s own intrinsic reasons for behavioral change. In MI research, the clients’ attitude (willingness or resistance) toward change as expressed through language, has been identified as an important indicator of their subsequent behavior change. Automated coding of these indicators provides systematic and efficient means for the analysis and assessment of MI therapy sessions. In this paper, we study and analyze behavioral cues in client language and speech that bear indications of the client’s behavior toward change during a therapy session, using a database of dyadic motivational interviews between therapists and clients with alcohol-related problems. Deep language and voice encoders, i.e., BERT and VGGish, trained on large amounts of data are used to extract features from each utterance. We develop a neural network to automatically detect the MI codes using both the clients’ and therapists’ language and clients’ voice, and demonstrate the importance of semantic context in such detection. Additionally, we develop machine learning models for predicting alcohol-use behavioral outcomes of clients through language and voice analysis. Our analysis demonstrates that we are able to estimate MI codes using clients’ textual utterances along with preceding textual context from both the therapist and client, reaching an F1-score of 0.72 for a speaker-independent three-class classification. We also report initial results for using the clients’ data for predicting behavioral outcomes, which outlines the direction for future work.

Key Findings

Superior Text Performance: Text-based models significantly outperformed speech-based models for MI code prediction, with BERT embeddings proving more effective than traditional linguistic features, achieving F1-scores up to 0.721 compared to speech models at 0.531.
Context Importance: Including historical context from preceding client and therapist utterances statistically significantly improved classification performance compared to using only current client utterances, demonstrating the importance of conversational flow in understanding client behavior.
Three-Class MI Code Classification: Successfully automated the detection of Change Talk (willingness to change), Sustain Talk (resistance to change), and Follow/Neutral utterances with an overall F1-score of 0.72, outperforming previous baseline work that achieved 0.566.
Behavioral Outcome Prediction Challenges: Predicting subsequent alcohol-related behavioral outcomes showed only marginal improvement over chance baseline, indicating the complexity of linking in-session behavior to real-world outcomes and highlighting the need for additional personal and contextual factors.
Multimodal Fusion Limitations: While multimodal approaches combining text and speech showed promise, they slightly underperformed text-only models (F1-score of 0.714 vs 0.721), likely due to low-quality audio recordings and noise interference affecting speech feature extraction.

Methodology

The researchers utilized real-world motivational interviewing sessions from 219 college students dealing with alcohol-related issues, collected through IRB-approved clinical datasets with participant consent. The study employed a comprehensive multimodal approach combining advanced deep learning encoders with behavioral analysis techniques.

Data Collection and Preprocessing: The dataset consisted of audio recordings, manual transcriptions, and MISC (Motivational Interviewing Skill Code) annotations for both therapist and client utterances. Sessions averaged 49.85 minutes in length with 41,494 client utterances and 51,802 therapist utterances. Speechmatics was used for automatic timestamp alignment between audio and text.

Feature Extraction Architecture: For textual analysis, the team employed BERT (Bidirectional Encoder Representations from Transformers) to generate 768-dimensional embeddings per utterance, alongside LIWC (Linguistic Inquiry Word Count) for interpretable psychological feature analysis. Speech processing utilized VGGish, a deep convolutional neural network pre-trained on AudioSet, generating 128-dimensional embeddings every 0.96 seconds, combined with eGeMAPS acoustic features for interpretability.

Model Development: The core architecture featured separate encoders for text (fully connected layers) and speech (single-layer GRU for sequence processing) that mapped BERT and VGGish embeddings to fixed-size 256-dimensional representations. For contextual modeling, three preceding client-therapist dialogue turns were encoded separately and concatenated with current utterance representations.

Training and Evaluation Protocol: The study employed one-subject-out cross-validation across 219 sessions to ensure speaker independence. Class imbalance was addressed using weighted cross-entropy loss with weights inversely proportional to class frequency. Models were optimized using Adam optimizer with learning rates of 10^-3 for unimodal and 10^-5 for fusion models, with 10% validation holdout for model selection.

Behavioral Outcome Analysis: For predicting alcohol-related outcomes (Change in Typical Blood Alcohol Content and Change in Alcohol-Related Problems), the researchers extracted sequences of client utterances by MI code type and developed GRU-based models to capture temporal patterns within sessions.

Impact

This research represents a significant advancement in automated therapeutic assessment with broad implications for both clinical practice and computational behavior analysis. The work demonstrates the feasibility of using advanced NLP and speech processing techniques to objectively assess therapy effectiveness and client engagement patterns.

Clinical Practice Innovation: The automated MI code detection system offers therapists and researchers an objective, efficient alternative to manual coding, which is traditionally time-consuming and subject to inter-rater variability. With F1-scores reaching 0.72, the system approaches human-level performance for behavioral coding, potentially enabling real-time feedback during therapy sessions.

Scalable Therapy Assessment: The speaker-independent model architecture allows for deployment across different therapeutic settings without requiring therapist-specific training, making it suitable for large-scale implementation in clinical practice, training programs, and research studies.

Methodological Contributions: The study’s integration of contextual information and demonstration of text modality superiority provides crucial insights for future therapeutic AI development. The finding that historical context significantly improves classification performance establishes the importance of conversational flow in understanding client behavior change patterns.

Research Infrastructure Development: By successfully applying state-of-the-art deep learning models (BERT, VGGish) to therapeutic interaction analysis, the work establishes a methodological framework that can be adapted to other therapy modalities and behavioral health applications.

Future Therapeutic AI Applications: The research lays groundwork for developing real-time therapy assistance tools, automated quality assurance systems for therapy training, and objective outcome prediction models that could enhance treatment planning and intervention strategies. The modest success in behavioral outcome prediction highlights important challenges and directions for future research incorporating personal and environmental factors.