-
Notifications
You must be signed in to change notification settings - Fork 206
Description
Title
Five correctness issues in diffprivlib (linear regression, k-means, vector mechanisms) + offer to contribute bounded Gaussian mechanism
Describe the bug
There are five places where current behavior appears inconsistent with standard DP sensitivity/accounting or with intended API semantics.
models/linear_regression.py→_construct_regression_obj(≈ L144)
-
Current problem: Feature sensitivity is computed using the lower bound twice, ignoring the upper bound; this can underestimate sensitivity on asymmetric ranges.
# current sensitivity = np.max(np.abs([bounds_X[0][i], bounds_X[0][i]])) ** 2
-
Improvement with proposed change: Use the maximum absolute endpoint of the interval, then square:
# proposed sensitivity = np.max(np.abs([bounds_X[0][i], bounds_X[1][i]])) ** 2
This restores conservative, correct calibration per-coordinate.
models/linear_regression.py→fit(≈ L295)
-
Current problem: DP-centering of X and y share a single budget slice; composition is not accounted for, which can over-spend privacy.
# current epsilon_intercept_scale = 1 / (n_features + 1) if self.fit_intercept else 0
-
Improvement with proposed change: Split once more by 2 so the two centering operations compose correctly:
# proposed epsilon_intercept_scale = 1 / ((n_features + 1) * 2) if self.fit_intercept else 0
This aligns with basic DP composition for two independent queries.
models/k_means.py→_update_centers(≈ L238)
-
Current problem:
epsilon_0(init) andepsilon_i(per-iteration) appear swapped, misallocating noise between initialization and updates.# current epsilon_0, epsilon_i = self._split_epsilon(dims, total_iters)
-
Improvement with proposed change: Assign in the intended order so init vs. iterative budgets match semantics:
# proposed epsilon_i, epsilon_0 = self._split_epsilon(dims, total_iters)
This fixes privacy accounting and typically improves accuracy.
mechanisms/vector.py→randomise(≈ L150)
-
Current problem: Privacy calibration uses a linear dependence on the data sensitivity bound; DP-ERM analyses (e.g., CMS11) imply a quadratic dependence for the relevant constant, risking under-calibrated noise.
# current epsilon_p = self.epsilon - 2 * np.log(1 + self.function_sensitivity * self.data_sensitivity / self.alpha)
-
Improvement with proposed change: Use quadratic dependence:
# proposed epsilon_p = self.epsilon - 2 * np.log(1 + self.function_sensitivity * self.data_sensitivity**2 / self.alpha)
This restores theoretical consistency in the privacy–utility trade-off.
mechanisms/vector.py→randomise(≈ L154)
-
Current problem: The same linear vs. quadratic scaling issue propagates to
delta, affecting the per-coordinate failure probability.# current delta = (self.function_sensitivity * self.data_sensitivity / np.expm1(self.epsilon / 4) - self.alpha) / self.n
-
Improvement with proposed change: Quadratic scaling for consistency with item (4):
# proposed delta = (self.function_sensitivity * self.data_sensitivity**2 / np.expm1(self.epsilon / 4) - self.alpha) / self.n
This corrects the calibration of the vector mechanism.
Note: I have implemented a bounded Gaussian mechanism that is currently missing from the library and can package it to match diffprivlib’s API/style (parameterization, clipping, and accounting consistent with the library’s conventions). If there is interest, I’m happy to contribute this as a PR and discuss tests, documentation, and theoretical notes.
To Reproduce
- Inspect the files at the indicated lines to see the current expressions.
- For (1): Use asymmetric feature bounds; observe sensitivity ignores the upper bound.
- For (2): Run
LinearRegression(fit_intercept=True)with DP-centering; note X and y share the same budget slice. - For (3): Run
KMeans; log the split from_split_epsilonand observe the swap. - For (4–5): Instantiate the vector mechanism; compare calibration with DP-ERM references showing quadratic dependence on the data-norm bound.
Expected behavior
- Sensitivity per feature uses
max(abs(lower), abs(upper))**2. - DP-centering budgets for X and y compose (additional split by 2).
epsilon_0(init) andepsilon_i(iteration) are assigned in the intended order.- Vector mechanism calibration reflects quadratic dependence on the data sensitivity bound in both
epsilon_panddelta.
Screenshots
N/A
System information (please complete):
- OS: [e.g., Ubuntu 22.04]
- Python version: [e.g., 3.10.13]
- diffprivlib version or commit: [e.g., 0.x / commit hash]
- numpy / scikit-learn version: [e.g., numpy 1.26.x / scikit-learn 1.4.x]