Challenges in Automated Processing of Speech from Child Wearables: The Case of Voice Type Classifier
Abstract
Recordings gathered with child-worn devices promised to revolutionize both fundamental and applied speech sciences by allowing the effortless capture of children's naturalistic speech environment and language production. This promise hinges on speech technologies that can transform the sheer mounds of data thus collected into usable information. This paper demonstrates several obstacles blocking progress by summarizing three years' worth of experiments aimed at improving one fundamental task: Voice Type Classification. Our experiments suggest that improvements in representation features, architecture, and parameter search contribute to only marginal gains in performance. More progress is made by focusing on data relevance and quantity, which highlights the importance of collecting data with appropriate permissions to allow sharing.
This repo contains the script needed to train the Whisper-VTC models, perform inference on a set of audio files and evaluate the models given ground-truth annotations.
Ensure that you have uv installed on you system.
Clone the repo and setup dependencies:
git clone git@github.com:LAAC-LSCP/VTC-IS-25.git
cd VTC-IS-25
uv sync
The audio files for inference simply needs to lie in a simple repository, the inference script will load them automatically.
Before anything, you'll need to download the weights of the pre-trained Whisper small model using the save_load_whisper.py
scripts.
uv run scripts/save_load_whisper.py --model small
Inference is done using a checkpoint of the model, linking the corresponding config file used for training and the list of audio files to run the inference on.
uv run scripts/infer.py \
--config model/config.yml \
--wavs audios \
--checkpoint model/best.ckpt \
--output predictions
Simply specify the input folder and output folder.
For more fine-grained tuning, use the min-duration-on-s
and min-duration-off-s
parameters.
uv run scripts/merge_segments.py \
--folder rttm_folder \
--output rttm_merged
To perform inference and speech segment merging (see merge_segments.py for help or this pyannote.audio description), a single bash script is given.
Simply set the correct variables in the script and run it:
sh scripts/run.sh
@inproceedings{kunze25_interspeech,
title = {{Challenges in Automated Processing of Speech from Child Wearables: The Case of Voice Type Classifier}},
author = {Tarek Kunze and Marianne Métais and Hadrien Titeux and Lucas Elbert and Joseph Coffey and Emmanuel Dupoux and Alejandrina Cristia and Marvin Lavechin},
year = {2025},
booktitle = {{Interspeech 2025}},
pages = {2845--2849},
doi = {10.21437/Interspeech.2025-1962},
issn = {2958-1796},
}
This work uses the segma library which is heavely inspired by pyannote.audio.
The first version of the Voice Type Classifier is available here.