Active Learning with a Noisy Annotator

This repository contains the official code implementation for the paper:
"Active Learning with a Noisy Annotator" [arXiv].

Overview

Active Learning (AL) is a powerful approach to reduce labeling costs by selecting the most informative examples for annotation. However, in real-world settings, annotators may introduce label noise, which can significantly harm AL performance — especially in the low-budget regime.

This project presents Noise-Aware Active Sampling (NAS) — a novel meta-algorithm that enhances coverage-based AL strategies to make them more robust to noisy labels. NAS integrates a lightweight noise-filtering step into the AL loop to identify and avoid mislabeled regions.

Noise-Aware Active Sampling (NAS): A visual overview of our NAS framework. NAS (illustrated by the dashed orange line) takes as input a query selection strategy 𝒮 and a noise-filtering algorithm 𝒜. It alternates between selecting 𝑏 samples using 𝒮, sending them for annotation, and filtering noisy samples with 𝒜 before proceeding to the next round.

Repository Structure (partial)

.
deep-al/
│    ├── configs/                 # YAML config files for different experiments
│    ├── pycls/                 
│    │   ├── datasets/            # Includes classes for all the datasets supported in this repo.
│    │   │   └── utils/
│    │   │       └── features.py  # Mapping for SSL embeddings for each dataset
│    │   ├── al/                  # Active Learning strategies (query selection methods)
│    │   └── lnl/                 # Learning with Noisy Labels (LNL) methods
│    └── tools/                
│        └── train_al.py          # Main training + active learning loop
└── README.md                     # This file

Setup

Installation

git clone https://github.com/nettashafir/active_learning_with_label_noise.git
cd active_learning_with_label_noise
pip install -r requirements.txt

Pretrained Features

Typicality-based or Representation-based strategies (like ProbCover, MaxHerding, and their NAS-enhanced versions) require self-supervised learning (SSL) features (e.g., SimCLR, DINOv2). These features should be stored as PyTorch tensors or NumPy arrays of shape (N, d), where N is the number of samples and d is the embedding dimension.

Feature file paths are mapped in: deep-al/pycls/datasets/utils/features.py

To train SSL features, you can use popular libraries such as SCAN or solo-learn.

Usage

Run NPC (ProbCover+NAS) query selection strategy with LowBudgetAUM noise filter on CIFAR100 with 50% symmetric noise, using a ResNet18:

cd deep-al/tools
python train_al.py --cfg ../configs/cifar100/al/RESNET18.yaml \
                   --al npc \              # query selection algorithm
                   --initial_delta 0.65 \  # ProbCover's delta hyperparameter
                   --initial_size 0 \      # Initial randomly selected labeled set
                   --budget 100 \          # budget for each round
                   --num_episodes 5 \      # number of active learning rounds
                   --noise_type sym \ 
                   --noise_rate 0.5 \
                   --lnl aum               # noise filtering algorithm

Or equivalently using samples_per_class:

python train_al.py --cfg ../configs/cifar100/al/RESNET18.yaml \
                   --al npc \
                   --initial_delta 0.65 \
                   --initial_size 0 \
                   --samples_per_class 1 2 3 4 5 \
                   --noise_type sym \
                   --noise_rate 0.5 \
                   --lnl aum

Note that by default, the noise filtering algorithm given in --lnl is used both in the internal mechanism of NAS (if used) and to filter noisy samples before training. Another example: Run MaxHerding+NAS query selection strategy with CrossValidation noise filter on Clothing1M (real-world noisy dataset with ~38% noise), using linear probing over SSL features, while the model is trained on all labeled samples, not just those predicted to be clean:

cd deep-al/tools
python train_al.py --cfg ../configs/clothing1m/al/RESNET18.yaml \
                   --al maxherding_nas \
                   --initial_size 0 \
                   --samples_per_class 1 2 3 4 5 \
                   --lnl cv \
                   --use_linear_model \
                   --train_on_all_labeled_data

You can control most settings either via the config files or by passing them as arguments to train_al.py.

Citation

If you use this code, please cite our work:

@article{shafir2025active,
  title={Active Learning with a Noisy Annotator},
  author={Shafir, Netta and Hacohen, Guy and Weinshall, Daphna},
  journal={arXiv preprint arXiv:2504.04506},
  year={2025}
}

Acknowledgments

This project builds on Deep-AL.
We also use official implementations from:

Contact

Feel free to open an issue or reach out to netta.shafir@mail.huji.ac.il with any questions!

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
deep-al		deep-al
images		images
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Active Learning with a Noisy Annotator

Overview

Repository Structure (partial)

Setup

Installation

Pretrained Features

Usage

Citation

Acknowledgments

Contact

About

Uh oh!

Releases

Packages

Languages

License

nettashafir/active_learning_with_label_noise

Folders and files

Latest commit

History

Repository files navigation

Active Learning with a Noisy Annotator

Overview

Repository Structure (partial)

Setup

Installation

Pretrained Features

Usage

Citation

Acknowledgments

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages