This is the official repository of the paper: "DiffProb: Data Pruning for Face Recognition" (accepted at the LFA Workshop, FG 2025)
Face recognition models have made substantial progress due to advances in deep learning and the availability of large-scale datasets. However, reliance on massive annotated datasets introduces challenges related to training computational cost and data storage, as well as potential privacy concerns regarding managing large face datasets. This paper presents DiffProb, the first data pruning approach for the application of face recognition. DiffProb assesses the prediction probabilities of training samples within each identity and prunes the ones with identical or close prediction probability values, as they are likely reinforcing the same decision boundaries, and thus contribute minimally with new information. We further enhance this process with an auxiliary cleaning mechanism to eliminate mislabeled and label-flipped samples, boosting data quality with minimal loss. Extensive experiments on CASIA-WebFace with different pruning ratios and multiple benchmarks, including LFW, CFP-FP, and IJB-C, demonstrate that DiffProb can prune up to 50% of the dataset while maintaining or even, in some settings, improving the verification accuracies. Additionally, we demonstrate DiffProb’s robustness across different architectures and loss functions. Our method significantly reduces training cost and data volume, enabling efficient face recognition training and reducing the reliance on massive datasets and their demanding management.
You can request access to the files containing the indexes of the kept samples for each pruning strategy applied in this work here.
You can download the CASIA-WebFace dataset here.
- Run
train_everything.py
to train the original model (setconfig.is_original_train=True
inconfig/config.py
), whose predictions will be used to perform the pruning (in the paper, ResNet-50 + CosFace loss). This script will automatically generate the files necessary to perform DynUnc pruning
- Run
coreset_dynunc.py
to generate the kept sample list for the selected pruning percentage - Run
label_mapping.py
if you want to confirm that the number of ids has not been altered (this step is not mandatory) - Run
train_everything.py
under the desired settings
Note: keep in mind that Rand can be applied before performing step 1
2. Run coreset_rand.py
to generate the kept sample list for the selected pruning percentage
3. Run label_mapping.py
if you want to confirm that the number of ids has not been altered (this step is not mandatory)
4. Run train_everything.py
under the desired settings
- Run
eval_trainset.py
to generate the ground truth prediction of the pre-trained FR model for each sample
Without Cleaning
- Run
eval_simprobs.py
to generate the kept sample list for the selected pruning percentage - Run
label_mapping.py
if you want to confirm that the number of ids has not been altered (this step is not mandatory) - Run
train_everything.py
under the desired settings
With Cleaning
- Run
clean_trainset.py
to apply our auxiliary cleaning mechanism and generate the kept sample list for the selected pruning percentage - Run
generate_label_dict.py
to generate a dictionary associating each identity (class label) with the indexes of its samples - Run
label_mapping.py
if you want to confirm that the new number of ids and to generate a label map, as some identities might be eliminated (this step is mandatory) - Run
train_everything.py
under the desired settings
Run eval_ijbc.py
to perform IJB-C evaluation
If you use any of the code, pruned datasets or models provided in this repository, please cite the following paper:
@misc{caldeira2025diffprobdatapruningface, title={DiffProb: Data Pruning for Face Recognition}, author={Eduarda Caldeira and Jan Niklas Kolf and Naser Damer and Fadi Boutros}, year={2025}, eprint={2505.15272}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2505.15272}, }
This project is licensed under the terms of the Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. Copyright (c) 2025 Fraunhofer Institute for Computer Graphics Research IGD Darmstadt