Skip to content

Commit cdcc082

Browse files
committed
Merge branch 'develop' into feature/dul-extensions
2 parents c782e5d + 8d560e6 commit cdcc082

File tree

147 files changed

+7066
-3149
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

147 files changed

+7066
-3149
lines changed

.test_durations

Lines changed: 6 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1521,8 +1521,8 @@
15211521
"tests/valuation/methods/test_semivalues.py::test_coefficients[BetaShapleyValuation-kwargs1-10]": 0.0016590010000072652,
15221522
"tests/valuation/methods/test_semivalues.py::test_coefficients[BetaShapleyValuation-kwargs2-100]": 0.0022294990000091275,
15231523
"tests/valuation/methods/test_semivalues.py::test_coefficients[BetaShapleyValuation-kwargs2-10]": 0.003863207999984297,
1524-
"tests/valuation/methods/test_semivalues.py::test_coefficients[DataBanzhafValuation-kwargs3-100]": 0.001800666000065121,
1525-
"tests/valuation/methods/test_semivalues.py::test_coefficients[DataBanzhafValuation-kwargs3-10]": 0.0016530420000435697,
1524+
"tests/valuation/methods/test_semivalues.py::test_coefficients[BanzhafValuation-kwargs3-100]": 0.001800666000065121,
1525+
"tests/valuation/methods/test_semivalues.py::test_coefficients[BanzhafValuation-kwargs3-10]": 0.0016530420000435697,
15261526
"tests/valuation/methods/test_semivalues.py::test_coefficients[ShapleyValuation-kwargs4-100]": 0.0018769589999578784,
15271527
"tests/valuation/methods/test_semivalues.py::test_coefficients[ShapleyValuation-kwargs4-10]": 0.0016063749999375432,
15281528
"tests/valuation/methods/test_semivalues.py::test_msr_banzhaf[5]": 9.342398666999998,
@@ -1636,10 +1636,10 @@
16361636
"tests/valuation/scorers/test_classwise.py::test_classwise_scorer[test_data2-expected_scores2]": 0.0025690839999974457,
16371637
"tests/valuation/scorers/test_scorers.py::test_compose_score": 0.0019082069999996065,
16381638
"tests/valuation/scorers/test_scorers.py::test_scorer": 0.001976999999998341,
1639-
"tests/valuation/test_interface.py::test_data_banzhaf_valuation[1]": 0.0836418330000015,
1640-
"tests/valuation/test_interface.py::test_data_banzhaf_valuation[2]": 1.2780167490000025,
1641-
"tests/valuation/test_interface.py::test_data_beta_shapley_valuation[1]": 4.139234666999997,
1642-
"tests/valuation/test_interface.py::test_data_beta_shapley_valuation[2]": 3.603092916999998,
1639+
"tests/valuation/test_interface.py::test_banzhaf_valuation[1]": 0.0836418330000015,
1640+
"tests/valuation/test_interface.py::test_banzhaf_valuation[2]": 1.2780167490000025,
1641+
"tests/valuation/test_interface.py::test_beta_shapley_valuation[1]": 4.139234666999997,
1642+
"tests/valuation/test_interface.py::test_beta_shapley_valuation[2]": 3.603092916999998,
16431643
"tests/valuation/test_interface.py::test_shapley_valuation[1]": 0.27120083299999465,
16441644
"tests/valuation/test_interface.py::test_shapley_valuation[2]": 0.15037520699999618,
16451645
"tests/valuation/test_interface.py::test_data_utility_learning[1]": 0.026216332999993597,
@@ -1781,10 +1781,6 @@
17811781
"tests/value/shapley/test_montecarlo.py::test_linear_montecarlo_with_outlier[owen-kwargs1-scorer0-0.2-2-0-21]": 6.573138832000012,
17821782
"tests/value/shapley/test_montecarlo.py::test_linear_montecarlo_with_outlier[owen_antithetic-kwargs2-scorer0-0.2-2-0-21]": 10.124256999999972,
17831783
"tests/value/shapley/test_montecarlo.py::test_linear_montecarlo_with_outlier[permutation_montecarlo-kwargs0-scorer0-0.2-2-0-21]": 2.7115268339999545,
1784-
"tests/value/shapley/test_montecarlo.py::test_montecarlo_shapley_housing_dataset[12-3-12-combinatorial_montecarlo-kwargs0]": 0.16786966001382098,
1785-
"tests/value/shapley/test_montecarlo.py::test_montecarlo_shapley_housing_dataset[12-3-12-owen-kwargs1]": 17.011920137971174,
1786-
"tests/value/shapley/test_montecarlo.py::test_montecarlo_shapley_housing_dataset[12-3-12-owen_antithetic-kwargs2]": 35.88025256394758,
1787-
"tests/value/shapley/test_montecarlo.py::test_montecarlo_shapley_housing_dataset[12-3-4-group_testing-kwargs3]": 0.25901710899779573,
17881784
"tests/value/shapley/test_montecarlo.py::test_seed[combinatorial_montecarlo-kwargs0-test_game0]": 0.04085670800000685,
17891785
"tests/value/shapley/test_montecarlo.py::test_seed[group_testing-kwargs3-test_game0]": 0.23488145900003587,
17901786
"tests/value/shapley/test_montecarlo.py::test_seed[owen-kwargs1-test_game0]": 0.30296191700003305,

CHANGELOG.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,10 @@
55

66
### Added
77

8+
- Simple memory monitor / reporting
9+
[PR #663](https://github.com/aai-institute/pyDVL/pull/663)
10+
- New stopping criterion `MaxSamples`
11+
[PR #661](https://github.com/aai-institute/pyDVL/pull/661)
812
- Introduced `UtilityModel` and two implementations `IndicatorUtilityModel`
913
and `DeepSetsUtilityModel` for data utility learning
1014
[PR #650](https://github.com/aai-institute/pyDVL/pull/650)
@@ -56,8 +60,10 @@
5660

5761
### Fixed
5862

59-
- Fixed `show_warnings=False` not being respected in subprocesses
63+
- Fixed `show_warnings=False` not being respected in subprocesses. Introduced
64+
`suppress_warninigs` decorator for more flexibility
6065
[PR #647](https://github.com/aai-institute/pyDVL/pull/647)
66+
[PR #662](https://github.com/aai-institute/pyDVL/pull/662)
6167
- Fixed several bugs in diverse stopping criteria, including: iteration counts,
6268
computing completion, resetting, nested composition
6369
[PR #641](https://github.com/aai-institute/pyDVL/pull/641)
@@ -83,6 +89,9 @@
8389

8490
### Changed
8591

92+
- Slicing, comparing and setting of `ValuationResult` behave in a more
93+
natural way
94+
[PR #660](https://github.com/aai-institute/pyDVL/pull/660)
8695
- Switched all semi-value coefficients and sampler weights to log-space in
8796
order to avoid overflows
8897
[PR #643](https://github.com/aai-institute/pyDVL/pull/643)

CONTRIBUTING.md

Lines changed: 32 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -15,10 +15,10 @@ If you are interested in setting up a similar project, consider the template
1515

1616
## Local development
1717

18-
This project uses [black](https://github.com/psf/black) to format code and
18+
This project uses [ruff](https://github.com/astral-sh/ruff) to lint and format code and
1919
[pre-commit](https://pre-commit.com/) to invoke it as a git pre-commit hook.
20-
Consider installing any of [black's IDE
21-
integrations](https://black.readthedocs.io/en/stable/integrations/editors.html)
20+
Consider installing any of [ruff's IDE
21+
integrations](https://docs.astral.sh/ruff/editors/setup/)
2222
to make your life easier.
2323

2424
Run the following to set up the pre-commit git hook to run before pushes:
@@ -83,7 +83,7 @@ If you use remote execution, don't forget to exclude data paths from deployment
8383
## Testing
8484

8585
Automated builds, tests, generation of documentation and publishing are handled
86-
by [CI pipelines](#CI). Before pushing your changes to the remote we recommend
86+
by [CI pipelines](#ci). Before pushing your changes to the remote we recommend
8787
to execute `tox` locally in order to detect mistakes early on and to avoid
8888
failing pipelines. tox will:
8989
* run the test suite
@@ -297,6 +297,33 @@ the environment variable `DYLD_FALLBACK_LIBRARY_PATH`:
297297
export DYLD_FALLBACK_LIBRARY_PATH=$DYLD_FALLBACK_LIBRARY_PATH:/opt/homebrew/lib
298298
```
299299

300+
### Automatic API documentation
301+
302+
We use [mkdocstrings](https://mkdocstrings.github.io/) to automatically generate
303+
API documentation from docstrings, following almost verbatim [this
304+
recipe](https://mkdocstrings.github.io/recipes/#automatic-code-reference-pages):
305+
Stubs are generated for all modules on the fly using
306+
[generate_api_docs.py](https://github.com/aai-institute/pyDVL/blob/develop/build_scripts/generate_api_docs.py) thanks to the pluging
307+
[mkdocstrings-gen-files](https://github.com/oprypin/mkdocs-gen-files) and
308+
navigation is generated for
309+
[mkdocs-literate-nav](https://github.com/oprypin/mkdocs-literate-nav).
310+
311+
With some renaming and using
312+
[section-index](https://github.com/oprypin/mkdocs-section-index) `__init__.py`
313+
files are used as entry points for the documentation of a module.
314+
315+
Since very often we re-export symbols in the `__init__.py` files, the automatic
316+
generation of the documentation skips **all** symbols in those files. If you
317+
want to document any in particular you can do so by **overriding
318+
mkdocs_genfiles**: Create a file under `docs/api/pydvl/module/index.md` and add
319+
your documentation there. For example, to document the whole module and every
320+
(re-)exported symbol just add this to the file:
321+
322+
```markdown
323+
::: pydvl.module
324+
```
325+
326+
300327
### Adding new pages
301328

302329
Navigation is configured in `mkdocs.yaml` using the nav section. We use the
@@ -441,7 +468,7 @@ use braces for legibility like in the first example.
441468
### Abbreviations
442469

443470
We keep the abbreviations used in the documentation inside the
444-
[docs_include/abbreviations.md](https://github.com/aai-institute/pyDVL/blob/develop/docs_includes%2Fabbreviations.md) file.
471+
[docs_include/abbreviations.md](https://github.com/aai-institute/pyDVL/blob/develop/docs_includes/abbreviations.md) file.
445472

446473
The syntax for abbreviations is:
447474

README.md

Lines changed: 40 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -161,53 +161,55 @@ lazy_influences.to_zarr("influences_result", TorchNumpyConverter())
161161
The steps required to compute data values for your samples are:
162162

163163
1. Import the necessary packages (the exact ones will depend on your specific
164-
use case).
165-
2. Create a `Dataset` object with your train and test splits.
166-
3. Create an instance of a `SupervisedModel` (basically any sklearn compatible
167-
predictor), and wrap it in a `Utility` object together with the data and a
168-
scoring function.
169-
4. Use one of the methods defined in the library to compute the values. In the
170-
example below, we will use *Permutation Montecarlo Shapley*, an approximate
171-
method for computing Data Shapley values. The result is a variable of type
164+
use case, but most of the interface is exposed through `pydvl.valuation`).
165+
2. Create two `Dataset` objects with your train and test splits. There are
166+
some factories to do this from arrays or scikit-learn toy datasets.
167+
3. Create an instance of a `SupervisedScorer`, with any sklearn scorer and a
168+
"valuation set" over which your model will be scored.
169+
4. Wrap model and scorer in a `ModelUtility`.
170+
5. Use one of the methods defined in the library to compute the values. In the
171+
example below, we use the most basic *Montecarlo Shapley* with uniform
172+
sampling, an approximate method for computing Data Shapley values.
173+
6. Call `fit` in a joblib parallel context. The result is a variable of type
172174
`ValuationResult` that contains the indices and their values as well as other
173-
attributes.
174-
5. Convert the valuation result to a dataframe, and analyze and visualize the
175-
values.
175+
attributes. This object can be sliced, sorted and inspected directly, or you
176+
can convert it to a dataframe for convenience.
176177

177178
The higher the value for an index, the more important it is for the chosen
178179
model, dataset and scorer. Reciprocally, low-value points could be mislabelled,
179180
or out-of-distribution, and dropping them can improve the model's performance.
180181

181182
```python
182-
from sklearn.datasets import load_breast_cancer
183-
from sklearn.linear_model import LogisticRegression
184-
185-
from pydvl.utils import Dataset, Scorer, Utility
186-
from pydvl.value import (MaxUpdates, RelativeTruncation,
187-
permutation_montecarlo_shapley)
188-
189-
data = Dataset.from_sklearn(
190-
load_breast_cancer(),
191-
train_size=10,
192-
stratify_by_target=True,
193-
random_state=16,
194-
)
195-
model = LogisticRegression()
196-
u = Utility(
197-
model,
198-
data,
199-
Scorer("accuracy", default=0.0)
200-
)
201-
values = permutation_montecarlo_shapley(
202-
u,
203-
truncation=RelativeTruncation(u, 0.05),
204-
done=MaxUpdates(1000),
205-
seed=16,
206-
progress=True
207-
)
208-
df = values.to_dataframe(column="data_value")
183+
from joblib import parallel_config
184+
from sklearn.datasets import load_iris
185+
from sklearn.svm import SVC
186+
from pydvl.valuation import Dataset, ShapleyValuation, UniformSampler,\
187+
MinUpdates, ModelUtility, SupervisedScorer
188+
189+
seed = 42
190+
model = SVC(kernel="linear", probability=True, random_state=seed)
191+
192+
train, val = Dataset.from_sklearn(load_iris(), train_size=0.6, random_state=24)
193+
scorer = SupervisedScorer(model, val, default=0.0)
194+
utility = ModelUtility(model, scorer)
195+
sampler = UniformSampler(batch_size=2 ** 6, seed=seed)
196+
stopping = MinUpdates(1000)
197+
valuation = ShapleyValuation(utility, sampler, stopping, progress=True)
198+
199+
with parallel_config(n_jobs=32):
200+
valuation.fit(train)
201+
202+
result = valuation.values()
203+
df = result.to_dataframe(column="shapley")
209204
```
210205

206+
### Deprecation notice
207+
208+
Up until v0.9.2 valuation methods were available through the `pydvl.value`
209+
module, which is now deprecated in favour of the design showcased above,
210+
available under `pydvl.valuation`. The old module will be removed in a future
211+
release.
212+
211213
# Contributing
212214

213215
Please open new issues for bugs, feature requests and extensions. You can read

build_scripts/generate_api_docs.py

Lines changed: 40 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,70 @@
11
"""Generate the code reference pages."""
22

3+
import logging
4+
import os
35
from pathlib import Path
46

57
import mkdocs_gen_files
68

9+
logger = logging.getLogger(__name__)
10+
11+
EXCLUDES = [("pydvl", "valuation", "methods", "twodshapley")]
12+
713
nav = mkdocs_gen_files.Nav()
14+
doc_root = Path("docs")
815
root = Path("src") # / Path("pydvl")
916
for path in sorted(root.rglob("*.py")):
1017
module_path = path.relative_to(root).with_suffix("")
1118
doc_path = path.relative_to(root).with_suffix(".md")
12-
full_doc_path = Path("api") / doc_path
1319
parts = tuple(module_path.parts)
1420

21+
extra_preamble = None
22+
if parts[:2] == ("pydvl", "value"):
23+
extra_preamble = (
24+
'!!! Danger "Deprecation notice"\n'
25+
" This module is deprecated since v0.10.0"
26+
" in favor of [pydvl.valuation][].\n"
27+
)
28+
full_doc_path = Path("deprecated") / doc_path
29+
elif parts[:2] == ("pydvl", "parallel"):
30+
extra_preamble = (
31+
'!!! Danger "Deprecation notice"\n'
32+
" This module is deprecated since v0.10.0 in favor of"
33+
" joblib's context manager [joblib.parallel_config][].\n"
34+
)
35+
full_doc_path = Path("deprecated") / doc_path
36+
elif parts in EXCLUDES:
37+
logger.info(f"Excluding {module_path}")
38+
continue
39+
else:
40+
full_doc_path = Path("api") / doc_path
41+
42+
extra_args = ""
1543
if parts[-1] == "__init__":
44+
logger.info(f"Excluding all members from {module_path}")
1645
parts = parts[:-1]
1746
doc_path = doc_path.with_name("index.md")
1847
full_doc_path = full_doc_path.with_name("index.md")
48+
extra_args = " options:\n members: []\n"
1949
elif parts[-1] == "__main__":
2050
continue
2151
elif parts[-1].startswith("_"):
2252
continue
2353

2454
nav[parts] = doc_path.as_posix()
2555

56+
if os.path.exists(doc_root / full_doc_path):
57+
logger.info(f"File {full_doc_path} already exists in {doc_root}, skipping.")
58+
continue
59+
2660
with mkdocs_gen_files.open(full_doc_path, "w") as fd:
2761
identifier = ".".join(parts)
62+
if extra_preamble:
63+
fd.write(extra_preamble)
2864
fd.write(f"::: {identifier}")
65+
if extra_args:
66+
fd.write("\n")
67+
fd.write(extra_args)
2968

3069
mkdocs_gen_files.set_edit_path(full_doc_path, path)
3170

docs/api/pydvl/value/shapley/classwise/img/classwise-shapley-discounted-utility-function.svg

Lines changed: 0 additions & 3 deletions
This file was deleted.

docs/assets/pydvl.bib

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -451,6 +451,19 @@ @inproceedings{schoch_csshapley_2022
451451
keywords = {notion}
452452
}
453453

454+
@article{semmler_re_2024,
455+
title = {[{{Re}}] {{Classwise-Shapley}} Values for Data Valuation},
456+
author = {Semmler, Markus and de Benito Delgado, Miguel},
457+
date = {2024-07},
458+
journaltitle = {Transactions on Machine Learning Research},
459+
shortjournal = {Trans. Mach. Learn. Res.},
460+
issn = {2835-8856},
461+
url = {https://openreview.net/forum?id=srFEYJkqD7&noteId=zVi6DINuXT},
462+
urldate = {2024-07-10},
463+
abstract = {We evaluate CS-Shapley, a data valuation method introduced in Schoch et al. (2022) for classification problems. We repeat the experiments in the paper, including two additional methods, the Least Core (Yan \& Procaccia, 2021) and Data Banzhaf (Wang \& Jia, 2023), a comparison not found in the literature. We include more conservative error estimates and additional metrics, like rank stability, and a variance-corrected version of Weighted Accuracy Drop, originally introduced in Schoch et al. (2022). We conclude that while CS-Shapley helps in the scenarios it was originally tested in, in particular for the detection of corrupted labels, it is outperformed by the conceptually simpler Data Banzhaf in the task of detecting highly influential points.},
464+
langid = {english}
465+
}
466+
454467
@book{trefethen_numerical_1997,
455468
title = {Numerical {{Linear Algebra}}},
456469
author = {Trefethen, Lloyd N. and Bau, Iii, David},
@@ -526,6 +539,23 @@ @inproceedings{wu_davinz_2022
526539
keywords = {notion}
527540
}
528541

542+
@article{wu_variance_2023,
543+
title = {Variance Reduced {{Shapley}} Value Estimation for Trustworthy Data Valuation},
544+
author = {Wu, Mengmeng and Jia, Ruoxi and Lin, Changle and Huang, Wei and Chang, Xiangyu},
545+
date = {2023-11-01},
546+
journaltitle = {Computers \& Operations Research},
547+
shortjournal = {Computers \& Operations Research},
548+
volume = {159},
549+
eprint = {2210.16835},
550+
eprinttype = {arXiv},
551+
pages = {106305},
552+
issn = {0305-0548},
553+
doi = {10.1016/j.cor.2023.106305},
554+
url = {https://www.sciencedirect.com/science/article/pii/S0305054823001697},
555+
urldate = {2023-09-17},
556+
abstract = {Data valuation, especially quantifying data value in algorithmic prediction and decision-making, is a fundamental problem in data trading scenarios. The most widely used method is to define the data Shapley and approximate it by means of the permutation sampling algorithm. To make up for the large estimation variance of the permutation sampling that hinders the development of the data marketplace, we propose a more robust data valuation method using stratified sampling, named variance reduced data Shapley (VRDS for short). We theoretically show how to stratify, how many samples are taken at each stratum, and the sample complexity analysis of VRDS. Finally, the effectiveness of VRDS is illustrated in different types of datasets and data removal applications.}
557+
}
558+
529559
@inproceedings{yan_if_2021,
530560
title = {If {{You Like Shapley Then You}}’ll {{Love}} the {{Core}}},
531561
booktitle = {Proceedings of the 35th {{AAAI Conference}} on {{Artificial Intelligence}}},

docs/deprecated/.meta.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
search:
2+
boost: -10

docs/deprecated/index.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
---
2+
title: New interface for data valuation
3+
alias:
4+
name: deprecation-data-valuation
5+
---
6+
7+
The module [pydvl.value][] and its submodules have been deprecated in favor of
8+
the new interface [pydvl.valuation][]. The new interface is more flexible and
9+
allows for more advanced data valuation techniques. The old interface will be
10+
removed in a future release.
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
!!! Danger "Deprecation notice"
2+
This module is deprecated since v0.10.0 in favor of [pydvl.valuation][].
3+
4+
::: pydvl.value.least_core
5+
options:
6+
members:
7+
- LeastCoreMode
8+
- compute_least_core_values

0 commit comments

Comments
 (0)