Authors
Larry Zhang, Joshua Driscol, Xiaotong Chen, Reza Hosseini Ghomi
Abstract
Depression affects hundreds of millions of individuals worldwide. With the prevalence of depression increasing, economic costs of the illness are growing significantly. The AVEC 2019 Detecting Depression with AI (Artificial Intelligence) Sub-Challenge provides an opportunity to use novel signal processing, machine learning, and artificial intelligence technology to predict the presence and severity of depression in individuals through digital biomarkers such as vocal acoustics, linguistic contents of speech, and facial expression. In our analysis, we point out key factors to consider during pre-processing and modelling to effectively build voice biomarkers for depression. We additionally verify the dataset for balance in demographic and severity score distribution to evaluate the generalizability of our results.
Key Findings
-
Dataset Imbalance Issues: Identified significant bias in both AVEC 2017 and 2019 datasets toward non-MDD participants, with non-MDD to MDD ratios of 5:2 and 3:1 respectively, raising concerns about model generalizability and the ability to capture depressed patient data appropriately.
-
Audio Contamination Problems: Discovered that the robotic interviewer “Ellie’s” voice was included in acoustic feature extractions, potentially introducing noise and bias. Models performed similarly whether Ellie’s voice was included or excluded, but participant-only audio showed marginally better performance in some cases.
-
Linguistic Feature Correlations: Found meaningful correlations between depression severity and linguistic patterns, including first-person singular pronoun usage (r=0.16 with PHQ-8), negative-valence word frequency (r>0.42), and third-person pronoun usage (r=0.31-0.36 in development set).
-
Model Performance: Random Forest models consistently outperformed logistic regression across both datasets, achieving best results with Mean Absolute Error of 5.77 and RMSE of 6.78 on AVEC 2019 test set, though overall prediction errors remained high (±5.8 from actual scores).
-
Methodological Recommendations: Emphasized the importance of participant-level rather than question-level modeling to avoid interdependencies, proper handling of out-of-vocabulary items, and the need for larger, more balanced datasets to improve depression detection accuracy.
Methodology
The researchers conducted a comprehensive evaluation of the AVEC 2019 Detecting Depression with AI Sub-Challenge dataset, implementing rigorous preprocessing and modeling approaches for both acoustic and linguistic features.
Dataset Analysis and Preprocessing: The study utilized data from 275 participants drawn from military veteran populations and the general public, with PHQ-8 depression severity scores and PTSD co-morbidity measures. Audio preprocessing involved extracting participant-only speech segments using timestamp information to create two versions of each dataset—with and without the robotic interviewer Ellie’s voice contamination.
Acoustic Feature Extraction: For AVEC 2017 data, researchers used the COVAREP toolbox to extract 79 base features with 20ms windows shifted every 10ms, then calculated higher-order statistics (mean, maximum, minimum, median, standard deviation, skew, kurtosis) yielding 553 final features. For AVEC 2019 data, they employed the eGeMAPS parameter set via openSMILE toolkit, extracting 88 acoustic features with 1-second windows.
Linguistic Analysis Framework: Developed theory-driven linguistic features including absolutist word frequency, negative-valence word usage from General Inquirer categories, first-person vs. plural pronoun ratios to model intrapersonal distress, and normalized word counts. Used both frequency vectors and Doc2Vec embeddings for text representation.
Statistical and Modeling Approaches: Applied Pearson’s and Spearman’s correlation analyses to identify feature-target relationships. For predictive modeling, used participant-level sampling (163 training, 56 development, 56 test samples) with traditional machine learning methods including logistic regression with L1 regularization, Random Forest with hyperparameter tuning, AdaBoost, and Ridge regression. Performance evaluation used RMSE and MAE metrics with k-fold cross-validation.
Data Quality Assessment: Conducted extensive visualization and statistical analysis of demographic distributions, PHQ-8 score distributions, and gender balance across training/development/test splits to evaluate dataset representativeness and potential biases affecting model generalizability.
Impact
This research provides critical insights into the challenges and considerations necessary for developing robust digital biomarkers for depression detection, with significant implications for both research methodology and clinical applications.
Methodological Contributions: The study establishes important best practices for depression detection research, including the necessity of isolating participant-only audio, implementing participant-level rather than question-level modeling to avoid statistical dependencies, and conducting thorough dataset bias analysis before model development. These methodological insights are crucial for researchers working with similar datasets.
Dataset Quality Assessment: By systematically identifying and quantifying dataset limitations—including severe class imbalance, audio contamination, and distributional biases between training and development sets—the research provides a blueprint for critically evaluating mental health datasets and highlights the importance of data quality in achieving reliable clinical applications.
Clinical Translation Challenges: The findings reveal significant obstacles to clinical deployment of AI-based depression detection systems, including the high prediction errors (±5.8 PHQ-8 points) that may limit clinical utility. The research emphasizes that meaningful clinical applications require substantially larger, more representative datasets and more sophisticated modeling approaches.
Signal Processing Insights: The identification of meaningful acoustic and linguistic correlates of depression (e.g., spectral flux, pitch characteristics, pronoun usage patterns, negative word frequency) provides empirical support for voice and language as depression biomarkers while highlighting the complexity of extracting reliable signals from naturalistic clinical data.
Future Research Directions: The study’s recommendations for improving dataset quality, including better demographic balance, individual symptom-level analysis rather than aggregate PHQ scores, and accounting for co-morbidities like PTSD and anxiety, provide a roadmap for developing more clinically relevant depression detection systems. The emphasis on traditional machine learning approaches for small datasets also offers practical guidance for researchers with limited computational resources.