Automatic Depression Detection Using an Interpretable Audio-Textual Multi-modal Transformer-based Model

This repository contains the code and resources for our multi-modal Transformer-based framework that detects depression from audio and text modalities. Our approach leverages the self-attention mechanism to improve diagnostic accuracy and provide interpretability by identifying the features (tokens in text or acoustic cues in audio) that contribute most strongly to the model’s predictions.

TABLE OF CONTENTS

Overview
Key Features
Model Architecture
Installation
Usage
Datasets
Results
Interpretability
Contributors
Citation

OVERVIEW Depression is a common mental disorder characterized by persistent low mood, loss of interest, and fatigue. Automated systems for early detection can encourage timely clinical intervention. In this project, we propose:

A multi-modal Transformer model that fuses text and audio features.
An interpretable architecture that provides attention-based insights into what drives classification decisions.

KEY FEATURES

Multi-modal Fusion: Combines audio (mel-spectrogram + NetVLAD) and text (BERT embeddings) signals.
Transformer-based: Leverages self-attention to capture long-range dependencies in both modalities.
High Accuracy: Evaluated on benchmark datasets, achieving state-of-the-art or near state-of-the-art performance.
Explainable: Visualizes attention weights in the text modality to highlight which words or phrases are most influential for the final prediction.

MODEL ARCHITECTURE

Text Encoding
- Uses a BERT tokenizer and embedding layer to generate contextual embeddings of the text transcript.
Audio Encoding
- Converts audio signals into mel-spectrograms, then encodes them via NetVLAD to produce 128-dimensional feature vectors.
Transformer Encoder
- Processes each modality (text/audio) through a Transformer encoder, capturing contextual relationships and long-range dependencies.
Fusion
- Concatenates the encoded audio and text embeddings into a joint representation.
Classification
- Passes the fused representation through a feed-forward network to predict depressed or not depressed.

A schematic of the approach might look like this:

Audio ---> NetVLAD --> Transformer Encoder ----
--> Concatenate --> Feed-Forward --> Output Text ---> BERT Embeddings --> Transformer Encoder ----/

INSTALLATION

Clone this repository: git clone https://github.com/your-username/Depression-Detection.git cd Depression-Detection
Create a virtual environment (recommended) and install dependencies: python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate pip install -r requirements.txt
Download any required checkpoints (e.g., pre-trained BERT weights) as prompted by the code or from links provided in the repository.

USAGE

Prepare Data
- Organize your dataset (audio + text files) in the structure expected by the scripts in data_preprocessing/.
- Ensure the audio and text transcripts are correctly aligned.
Run Training python train.py --config configs/config.json
- Edit hyperparameters (e.g., learning rate, batch size) in configs/config.json.
Run Evaluation python evaluate.py --config configs/config.json --checkpoint path_to_checkpoint
- Reports accuracy, F1-score, and other metrics on the test set.
Visualize Attention python visualize_attention.py --checkpoint path_to_checkpoint --sample_text "Your sample text here"
- Saves attention heatmaps to the visualizations/ directory.

DATASETS We tested our framework on two datasets:

DAIC-WOZ
- Clinical interviews for diagnosing psychological distress, including depression, anxiety, and PTSD.
- Contains both audio recordings and transcripts, with depression labels (PHQ-8).
EATD Corpus
- Emotional audio-textual dataset for automatic depression detection (based on SDS).
- Contains shorter recordings (Chinese) and corresponding transcriptions.

Preprocessing scripts for both datasets can be found in the data_preprocessing/ folder. These scripts handle tasks such as audio segmentation, text tokenization, and alignment of modalities.

RESULTS

EATD:

Model	Modality	F1 Score
Bi-LSTM & GRU + Attention (Shen 2022)	Text+Audio	0.71
Proposed Transformer	Text+Audio	0.82

DAIC-WOZ:

Model	Modality	Accuracy (%)
Topic-Attentive Transformer (Guo 2022)	Text+Audio	73.9
Proposed Transformer	Text+Audio	75.8

Our multi-modal Transformer outperforms single-modal baselines and other state-of-the-art methods on both DAIC-WOZ and EATD corpora.

INTERPRETABILITY To interpret the model’s decisions, we visualize the self-attention weights from the Transformer’s text encoder.

Attention Heatmaps: Provide insights into which tokens (e.g., “smart”, “very”, “uh”) the model focuses on.
Multi-head Attention: Each head may capture different semantics. Some heads may focus on emotion-related words, others on linguistic structure.
Future Work: Extending attention-based interpretability to audio features (currently encoded as a single vector) would provide more granular insights into acoustic cues.

Example self-attention heatmap (text modality): Token Index: 1 2 3 4 ... N Token: [I] [feel] [very] ... [PAD] Head0_Attn 0.2 0.8 0.7 ... 0.0 Head1_Attn 0.0 0.1 0.1 ... 0.8

CONTRIBUTORS

Om Jodhpurkar jodhpurk@usc.edu
Sneh Thorat snehpram@usc.edu
Mehrshad Saadatinia saadatin@usc.edu
Pin-Tzu Lee pintzule@usc.edu
Sreya Reddy Chinthala chinthal@usc.edu

CITATION If you find this repository helpful in your research or projects, please cite our work:

NOTE: This project is intended for research purposes only. It is not a substitute for professional mental health diagnosis or treatment. Always seek advice from qualified healthcare professionals.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Attention_DAIC_WOZ_Classification.ipynb		Attention_DAIC_WOZ_Classification.ipynb
CSCI544_FinalReport (1).pdf		CSCI544_FinalReport (1).pdf
DAIC_WOZ_Classification.ipynb		DAIC_WOZ_Classification.ipynb
README.md		README.md
Word_Level_EATD_Processing.ipynb		Word_Level_EATD_Processing.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Automatic Depression Detection Using an Interpretable Audio-Textual Multi-modal Transformer-based Model

About

Uh oh!

Releases

Packages

Languages

infinity811/multimodal_depression_detection

Folders and files

Latest commit

History

Repository files navigation

Automatic Depression Detection Using an Interpretable Audio-Textual Multi-modal Transformer-based Model

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages