Automated voice biomarkers for depression symptoms using an online cross‐sectional data collection initiative

Authors

Larry Zhang, Radhika Duvvuri, Kiranmayi K. L. Chandra, Theresa Nguyen, Reza H. Ghomi

Abstract

Importance: Depression is an illness affecting a large percentage of the world’s population throughout the lifetime. To date, there is no available biomarker for depression detection and tracking of symptoms relies on patient self‐report.

Objective: To explore and validate features extracted from recorded voice samples of depressed subjects as digital biomarkers for suicidality, psychomotor disturbance, and depression severity.

Design: We conducted a cross‐sectional study over the course of 12 months using a frequently visited web form version of the PHQ9 hosted by Mental Health America (MHA) to ask subjects for anonymous voice samples via a separate web form hosted by NeuroLex Laboratories. Subjects were asked to provide demographics, answers to the PHQ9, and two voice samples.

Setting: Online only.

Participants: Users of the MHA website.

Main Outcomes and Measures: Performance of statistical models using extracted voice features to predict psychomotor disturbance, suicidality, and depression severity as indicated by the PHQ9.

Results: Voice features extracted from recorded audio of depressed subjects were able to predict PHQ9 question 9 and total scores with an area under the curve of 0.821 and a mean absolute error of 4.7, respectively. Psychomotor Disturbance prediction was less powerful with an area under the curve of 0.61.

Conclusion and Relevance: Automated voice analysis using short recordings of patient speech may be used to augment depression screen and symptom management.

Key Findings

Strong Suicidality Prediction: Voice features successfully predicted suicidal ideation (PHQ9 question 9) with an area under the curve of 0.821, demonstrating robust predictive power for this critical mental health indicator.
Depression Severity Assessment: Achieved a mean absolute error of 4.7 for predicting overall depression severity, which was comparable to or better than existing benchmark datasets like DAIC-WOZ, suggesting that brief voice samples can be as effective as clinical interviews.
Feature-Specific Performance: The best performing models used linguistic N-Gram TF-IDF features for suicidality prediction (AUC 0.821), acoustic features for overall depression severity assessment, and prosodic features for psychomotor disturbance detection (AUC 0.61).
Crowdsourced Data Collection Feasibility: Successfully demonstrated that anonymous, online voice collection through web traffic can gather clinically relevant data from 222 participants over 10 months without active recruitment or incentives.
Voice Feature Correlations: Identified specific voice characteristics associated with depression including changes in pitch variability, spectral characteristics (MFCC features), pause patterns, speech timing, and linguistic content patterns that correlated with symptom severity.

Methodology

The researchers conducted a 12-month cross-sectional study in partnership with Mental Health America, leveraging their high-traffic depression screening website to collect anonymous voice samples. The study used an innovative crowdsourced approach where users completing the PHQ9 depression questionnaire were invited to donate voice samples through a separate NeuroLex Laboratories web application.

Data Collection Protocol: Participants provided two voice samples: (1) reading the phrase “The quick brown fox jumps over the lazy dog” to capture phonemes and letters, and (2) giving a 30-second free speech sample for richer linguistic content. Data collection was limited to these tasks to minimize participant dropout while maintaining anonymity.

Voice Feature Extraction: Implemented three comprehensive feature extraction approaches: (1) Acoustic features using the Extended Geneva Minimalistic Acoustic Parameterization Set (eGeMAPS) with 88 features including F0, harmonics, loudness, and spectral characteristics, (2) Prosodic features measuring speech timing, pause patterns, and rhythm using custom webRTC-based analysis, and (3) Linguistic features through manual transcription and vectorization using count vectorization and N-gram TF-IDF methods.

Data Preprocessing and Validation: Applied rigorous quality control including minimum file size requirements (353 KB), voice activity detection to remove unvoiced samples, and noise reduction using second-order Butterworth band-pass filtering (300 Hz - 3.4 kHz). The final dataset included 390 valid audio files from 222 unique participants.

Statistical Modeling: Used gradient-boosted tree models with ElasticNet regularization for binary classification of symptoms and regression for depression severity prediction. Employed five-fold cross-validation with SMOTE up-sampling to address class imbalances, and evaluated performance using area under the curve (AUC) for classification and mean absolute error (MAE) for regression tasks.

Impact

This research represents a pioneering approach to depression assessment through digital biomarkers, with significant implications for mental health screening, monitoring, and accessibility. The study demonstrates the feasibility of using brief voice samples for objective depression assessment, potentially addressing limitations of traditional self-report measures.

Clinical Assessment Innovation: The ability to predict suicidality with AUC 0.821 using just 30 seconds of speech represents a breakthrough for suicide risk assessment, offering a potentially more objective and accessible screening method than traditional questionnaires. This could enable earlier intervention and more frequent monitoring of at-risk individuals.

Scalable Mental Health Screening: The crowdsourced data collection approach demonstrates potential for large-scale, cost-effective depression screening through existing web platforms. With Mental Health America’s website receiving tens of thousands of PHQ9 completions monthly, this method could facilitate population-level mental health surveillance.

Digital Therapeutics Foundation: The identification of specific voice biomarkers for depression symptoms establishes groundwork for digital therapeutic applications, including smartphone-based monitoring tools and voice-enabled mental health applications that could provide continuous, unobtrusive assessment of symptom changes.

Research Methodology Advancement: The study’s online-only, anonymous data collection methodology offers a model for mental health research that overcomes traditional barriers including geographical limitations, stigma concerns, and recruitment challenges, potentially enabling more diverse and representative research populations.

Future Clinical Integration: The comparable performance to clinical interview-based assessments (DAIC-WOZ) suggests potential for integrating voice biomarkers into clinical workflows, electronic health records, and telemedicine platforms to augment traditional depression screening and monitoring approaches.