- Organize code for prompt designing, model fine-tuning, and inference
- Provide hyperparameters for the experiments
- Release model weights to Huggingface hub (upon acceptance)
This is the repository for the paper Mental-LLM: Leveraging Large Language Models for Mental Health Prediction via Online Text Data
, an updated version of this paper is under review.
In this work, we present the first comprehensive evaluation of multiple LLMs, including Alpaca, Alpaca-LoRA, FLAN-T5, GPT-3.5, and GPT-4, on various mental health prediction tasks via online text data. We conduct a broad range of experiments, covering zero-shot prompting, few-shot prompting, and instruction fine-tuning. More importantly, our experiments show that instruction finetuning can significantly boost the performance of LLMs for all tasks simultaneously.
Our best-finetuned models, Mental-Alpaca and Mental-FLAN-T5, outperform the best prompt design of GPT-3.5 by 10.9% and the best of GPT-4 by 4.8% on balanced accuracy and perform on par with the state-of-the-art task-specific language model.
We have publically released our fine-tuned model weights on huggingface hub. The use of both model weights is limited to research purposes only:
- Mental-Alpaca: https://huggingface.co/NEU-HAI/mental-alpaca
- Mental-FLAN-T5: https://huggingface.co/NEU-HAI/mental-flan-t5-xxl
You may find sample codes to load both models from the repositories above directly. Details about the prompts, training process, and evaluations can be found in our paper. The GPU Memory requirement to load Mental-Alpaca and Mental-FLAN-T5 is 27GB and 44GB, respectively, and will require additional GPU Memory for inference.
- Dreaddit
This dataset collected posts from Reddit, which contains ten subreddits in the five domains (abuse, social, anxiety, PTSD, and financial).
We used this dataset for a post-level binary stress prediction (Task 1). - DepSeverity
This dataset leveraged the same posts collected in Dreaddit, but with a different focus on depression.
We employed this dataset for two post-level tasks: binary depression prediction (i.e., whether a post showed at least mild depression, Task 2), and four-level depression prediction (Task 3). - SDCNL
This dataset also collected posts from Reddit, including r/SuicideWatch and r/Depression.
We employed this dataset for the post-level binary suicide ideation prediction (Task 4). - CSSRS-Suicide
This dataset contains posts from 15 mental health-related subreddits.
We leveraged this dataset for two user-level tasks: binary suicide risk prediction (i.e., whether a user showed at least suicide indicator, Task 5), and five-level suicide risk prediction (Task 6).
- Alpaca-7b
- Alpaca-LoRA
- FLAN-T5-XXL
- GPT-3.5
- GPT-4
More results can be found in the paper.
- MentalRoBERTa (Baseline)
- For each dataset, we convert the original text labels into ascending numbers starting from 0
- num_train_epochs=3, per_device_train_batch_size = 4, gradient_accumulation_steps = 16, per_device_eval_batch_size= 8, learning_rate = 5e-5, warmup_steps=500, weight_decay=0.01, logging_steps = 8, fp16 = False
- Mental-Alpaca
- We mostly leverage the same fine-tuning hyperparameters provided here with minor changes to accomdate our computing resources
- Mental-FLAN-T5
- max_len=1024, target_max_len=128, per_device_train_batch_size=2, per_device_eval_batch_size=1, gradient_accumulation_steps=2, learning_rate=1e-4, num_train_epochs=2
@article{xu2023mentalllm,
title={Mental-LLM: Leveraging Large Language Models for Mental Health Prediction via Online Text Data},
author={Xuhai Xu and Bingshen Yao and Yuanzhe Dong and Saadia Gabriel and Hong Yu and James Hendler and Marzyeh Ghassemi and Anind K. Dey and Dakuo Wang},
year={2023},
eprint={2307.14385},
archivePrefix={arXiv},
primaryClass={cs.HC}
}