mislabeled

Model-probing mislabeled examples detection in machine learning datasets

A ModelProbingDetector assigns trust_scores to training examples $(x, y)$ from a dataset by probing an Ensemble of machine learning model.

Install

pip install git+https://github.com/orange-opensource/mislabeled

Find suspicious digits in MNIST

1. Train a MLP on MNIST

X, y = fetch_openml("mnist_784", return_X_y=True, as_frame=False)
y = LabelEncoder().fit_transform(y)
mlp = make_pipeline(MinMaxScaler(), MLPClassifier())
mlp.fit(X, y)

2. Compute Representer values of the MLP

probe = Representer()
representer_values = probe(mlp, X, y)

3. Inspect your training data

supicious = np.argsort(-representer_values)[0:top_k]
for i in suspicious:
  plt.imshow(X[i].reshape(28, 28))

4. Wanna get the variance of the Representer values during training ?

detector = ModelProbingDetector(mlp, Representer(), ProgressiveEnsemble(), "var")
var_representer_values = detector.trust_scores(X, y)

Predefined detectors

Detector	Paper	Code (`from mislabeled.detect.detectors`)
Area Under the Margin (AUM)	NeurIPS 2020	`import AreaUnderMargin`
Influence	Paper 1974	`import SelfInfluenceDetector`
Cook's Distance	Paper 1977	`import CookDistanceDetector`
Approximate Leave-One-Out	Paper 1981	`import ApproximateLOODetector`
Representer	Paper 1972	`import RepresenterDetector`
TracIn	NeurIPS 2020	`import TracIn`
Forget Scores	ICLR 2019	`import ForgetScores`
VoG	CVPR 2022	`import FiniteDiffVoG, FiniteDiffVoLG, VoLG`
Small Loss	ICML 2018	`import SmallLoss`
CleanLab	JAIR 2021	`import ConfidentLearning`
Consensus (C-Scores)	Applied Intelligence 2011	`import ConsensusConsistency`
AGRA	ECML 2023	`import AGRA`

and other limitless combinations by using ModelProbingDetector with any probe and Ensembles from the library.

Most of these detectors work for both regression and classification diagnostics.

Tutorials

For more details and examples, check the notebooks !

Paper

If you use this library in a research project, please consider citing the corresponding paper with the following bibtex entry:

@article{george2024mislabeled,
  title={Mislabeled examples detection viewed as probing machine learning models: concepts, survey and extensive benchmark},
  author={Thomas George and Pierre Nodet and Alexis Bondu and Vincent Lemaire},
  journal={Transactions on Machine Learning Research},
  issn={2835-8856},
  year={2024},
  url={https://openreview.net/forum?id=3YlOr7BHkx},
  note={}
}

Development

Formatting and linting is done with ruff as a pre-commit:

install: pre-commit install,
format and lint: pre-commit run --all-files (automatically done before a commit).

Run tests with uv: uv run pytest.

Name		Name	Last commit message	Last commit date
Latest commit History 682 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
examples		examples
mislabeled		mislabeled
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE.md		LICENSE.md
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

mislabeled

Install

Find suspicious digits in MNIST

1. Train a MLP on MNIST

2. Compute Representer values of the MLP

3. Inspect your training data

4. Wanna get the variance of the Representer values during training ?

Predefined detectors

Tutorials

Paper

Development

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 23

Uh oh!

Languages

License

Orange-OpenSource/mislabeled

Folders and files

Latest commit

History

Repository files navigation

mislabeled

Install

Find suspicious digits in MNIST

1. Train a MLP on MNIST

2. Compute Representer values of the MLP

3. Inspect your training data

4. Wanna get the variance of the Representer values during training ?

Predefined detectors

Tutorials

Paper

Development

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 23

Uh oh!

Languages

Packages