Skip to content

Orange-OpenSource/mislabeled

Repository files navigation

mislabeled

Model-probing mislabeled examples detection in machine learning datasets

A ModelProbingDetector assigns trust_scores to training examples $(x, y)$ from a dataset by probing an Ensemble of machine learning model.

Install

pip install git+https://github.com/orange-opensource/mislabeled

Find suspicious digits in MNIST

1. Train a MLP on MNIST

X, y = fetch_openml("mnist_784", return_X_y=True, as_frame=False)
y = LabelEncoder().fit_transform(y)
mlp = make_pipeline(MinMaxScaler(), MLPClassifier())
mlp.fit(X, y)

2. Compute Representer values of the MLP

probe = Representer()
representer_values = probe(mlp, X, y)

3. Inspect your training data

supicious = np.argsort(-representer_values)[0:top_k]
for i in suspicious:
  plt.imshow(X[i].reshape(28, 28))

4. Wanna get the variance of the Representer values during training ?

detector = ModelProbingDetector(mlp, Representer(), ProgressiveEnsemble(), "var")
var_representer_values = detector.trust_scores(X, y)

Predefined detectors

Detector Paper Code (from mislabeled.detect.detectors)
Area Under the Margin (AUM) NeurIPS 2020 import AreaUnderMargin
Influence Paper 1974 import SelfInfluenceDetector
Cook's Distance Paper 1977 import CookDistanceDetector
Approximate Leave-One-Out Paper 1981 import ApproximateLOODetector
Representer Paper 1972 import RepresenterDetector
TracIn NeurIPS 2020 import TracIn
Forget Scores ICLR 2019 import ForgetScores
VoG CVPR 2022 import FiniteDiffVoG, FiniteDiffVoLG, VoLG
Small Loss ICML 2018 import SmallLoss
CleanLab JAIR 2021 import ConfidentLearning
Consensus (C-Scores) Applied Intelligence 2011 import ConsensusConsistency
AGRA ECML 2023 import AGRA

and other limitless combinations by using ModelProbingDetector with any probe and Ensembles from the library.

Most of these detectors work for both regression and classification diagnostics.

Tutorials

For more details and examples, check the notebooks !

Paper

If you use this library in a research project, please consider citing the corresponding paper with the following bibtex entry:

@article{george2024mislabeled,
  title={Mislabeled examples detection viewed as probing machine learning models: concepts, survey and extensive benchmark},
  author={Thomas George and Pierre Nodet and Alexis Bondu and Vincent Lemaire},
  journal={Transactions on Machine Learning Research},
  issn={2835-8856},
  year={2024},
  url={https://openreview.net/forum?id=3YlOr7BHkx},
  note={}
}

Development

Formatting and linting is done with ruff as a pre-commit:

  • install: pre-commit install,
  • format and lint: pre-commit run --all-files (automatically done before a commit).

Run tests with uv: uv run pytest.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 23

Languages