This repository contains R scripts for analyzing audio files using the BirdNET classifier (Kahl et al., 2021), validating the results, visualizing classification accuracy, and generating a curated occurrence dataset in the Darwin Core Standard.
analyze_audio_files.R
: Analyzes raw audio files using the BirdNET classifier and formats the output with additional attributes (column headers) for easier data manipulation.retrieve_sample_audio.R
: Function that retrieves and samples audio files based on different criteria for manual expert validation.accuracy_and_plot.R
: Calculates and visualizes BirdNET classification accuracy, following Sethi et al., (2024), for each species based on expert manual validation.RF_probability_modelling.R
: Trains and evaluates a Random Forest model to predict the probability of correctness for non-validated records (i.e., the likelihood that the BirdNET classification is correct).create_final_dataset.R
: Produces and exports the final occurrence dataset in Darwin Core Standard.
The recommended directory structure for the above scripts to run as expected is as follows:
Project/
├── README.md
├── config.yaml # Configuration file with paths and parameters
├── script/
└── data/
├── species_data/
│ ├── species_list.txt # Custom/Global species list: "<scientific name>_<common name>"
│ └── species_taxon_data.xlsx # Taxon attributes (e.g., from Artportalen)
├── site_metadata.csv # CSV containing site metadata (e.g., lat, long, site ID, site name)
├── raw_audio/ # Directory containing raw WAV audio files
│ └── Survey-001/ # Survey/Round/Event ID
│ └── Site-001/ # Site ID
│ └── 20240421/ # Day/Date
│ └── 20240421_083800.wav
├── BirdNET_raw_results/ # Raw CSV results from the BirdNET analysis
│ └── Survey-001/
│ └── Site-001/
│ └── 20240421/
│ └── 20240421_083800.BirdNET.results.r.csv
├── validation_data/ # Sample audio files for expert manual validation
│ └── Carduelis_carduelis/ # Each species has a separate folder
│ ├── Carduelis carduelis.xlsx # Validation sheet
│ └── Survey-001_Site-001_20240421_072400_start_sec_18_end_sec_21_confidence_0.8658_Cardueli_carduelis.wav
└── output/ # Final results generated from processing
├── occurrence.txt
└── species_accuracy_and_misclassification.txt
└── species.txt
The dataset produced using the above scripts includes 239,597 occurrence records of 61 species from April 21 to June 16, 2024, across 30 sites in central Gothenburg, Sweden.
The dataset is available to download at https://zenodo.org/records/15490818.
A detailed documentation of the dataset, including URI identifiers, definitions, and examples for all attributes (column headers), is available at https://smog-chalmers.github.io/BirdMonitoringGothenburg/.
Depending on the use case, users may want to filter the dataset based on the occurrenceProbability attribute using the following recommended thresholds (see the paper for more details) to:
-
Balance sensitivity and specificity (an optimal threshold determined based on Youden's J statistic):
occurrence %>% filter(occurrenceProbability >= 0.75)
-
Maximize specificity (a stricter threshold that minimizes false positives):
occurrence %>% filter(occurrenceProbability >= 0.83)
If you use this dataset in your work, please cite as:
@misc{eldesoky2025birdspecies,
author = {Eldesoky, A. H. and Gil, J. and Kindvall, O. and Stavroulaki, I. and Jonasson, L. and Bennet, D. and Yang, W. and Martínez, A. and Lichter, R. and Petrou, F. and Berghauser Pont, M.},
title = {A bird species occurrence dataset from passive audio recordings across dense urban areas in Gothenburg, Sweden},
year = {2025},
note = {Data set},
publisher = {Zenodo},
doi = {10.5281/zenodo.15490818},
url = {https://zenodo.org/records/15490818}
}
MIT Licence