Skip to content

Five correctness issues in diffprivlib (linear regression, k-means, vector mechanisms) + offer to contribute bounded Gaussian mechanism #104

@imcjp

Description

@imcjp

Title
Five correctness issues in diffprivlib (linear regression, k-means, vector mechanisms) + offer to contribute bounded Gaussian mechanism


Describe the bug
There are five places where current behavior appears inconsistent with standard DP sensitivity/accounting or with intended API semantics.

  1. models/linear_regression.py_construct_regression_obj (≈ L144)
  • Current problem: Feature sensitivity is computed using the lower bound twice, ignoring the upper bound; this can underestimate sensitivity on asymmetric ranges.

    # current
    sensitivity = np.max(np.abs([bounds_X[0][i], bounds_X[0][i]])) ** 2
  • Improvement with proposed change: Use the maximum absolute endpoint of the interval, then square:

    # proposed
    sensitivity = np.max(np.abs([bounds_X[0][i], bounds_X[1][i]])) ** 2

    This restores conservative, correct calibration per-coordinate.

  1. models/linear_regression.pyfit (≈ L295)
  • Current problem: DP-centering of X and y share a single budget slice; composition is not accounted for, which can over-spend privacy.

    # current
    epsilon_intercept_scale = 1 / (n_features + 1) if self.fit_intercept else 0
  • Improvement with proposed change: Split once more by 2 so the two centering operations compose correctly:

    # proposed
    epsilon_intercept_scale = 1 / ((n_features + 1) * 2) if self.fit_intercept else 0

    This aligns with basic DP composition for two independent queries.

  1. models/k_means.py_update_centers (≈ L238)
  • Current problem: epsilon_0 (init) and epsilon_i (per-iteration) appear swapped, misallocating noise between initialization and updates.

    # current
    epsilon_0, epsilon_i = self._split_epsilon(dims, total_iters)
  • Improvement with proposed change: Assign in the intended order so init vs. iterative budgets match semantics:

    # proposed
    epsilon_i, epsilon_0 = self._split_epsilon(dims, total_iters)

    This fixes privacy accounting and typically improves accuracy.

  1. mechanisms/vector.pyrandomise (≈ L150)
  • Current problem: Privacy calibration uses a linear dependence on the data sensitivity bound; DP-ERM analyses (e.g., CMS11) imply a quadratic dependence for the relevant constant, risking under-calibrated noise.

    # current
    epsilon_p = self.epsilon - 2 * np.log(1 + self.function_sensitivity * self.data_sensitivity / self.alpha)
  • Improvement with proposed change: Use quadratic dependence:

    # proposed
    epsilon_p = self.epsilon - 2 * np.log(1 + self.function_sensitivity * self.data_sensitivity**2 / self.alpha)

    This restores theoretical consistency in the privacy–utility trade-off.

  1. mechanisms/vector.pyrandomise (≈ L154)
  • Current problem: The same linear vs. quadratic scaling issue propagates to delta, affecting the per-coordinate failure probability.

    # current
    delta = (self.function_sensitivity * self.data_sensitivity / np.expm1(self.epsilon / 4) - self.alpha) / self.n
  • Improvement with proposed change: Quadratic scaling for consistency with item (4):

    # proposed
    delta = (self.function_sensitivity * self.data_sensitivity**2 / np.expm1(self.epsilon / 4) - self.alpha) / self.n

    This corrects the calibration of the vector mechanism.

Note: I have implemented a bounded Gaussian mechanism that is currently missing from the library and can package it to match diffprivlib’s API/style (parameterization, clipping, and accounting consistent with the library’s conventions). If there is interest, I’m happy to contribute this as a PR and discuss tests, documentation, and theoretical notes.


To Reproduce

  1. Inspect the files at the indicated lines to see the current expressions.
  2. For (1): Use asymmetric feature bounds; observe sensitivity ignores the upper bound.
  3. For (2): Run LinearRegression(fit_intercept=True) with DP-centering; note X and y share the same budget slice.
  4. For (3): Run KMeans; log the split from _split_epsilon and observe the swap.
  5. For (4–5): Instantiate the vector mechanism; compare calibration with DP-ERM references showing quadratic dependence on the data-norm bound.

Expected behavior

  • Sensitivity per feature uses max(abs(lower), abs(upper))**2.
  • DP-centering budgets for X and y compose (additional split by 2).
  • epsilon_0 (init) and epsilon_i (iteration) are assigned in the intended order.
  • Vector mechanism calibration reflects quadratic dependence on the data sensitivity bound in both epsilon_p and delta.

Screenshots
N/A


System information (please complete):

  • OS: [e.g., Ubuntu 22.04]
  • Python version: [e.g., 3.10.13]
  • diffprivlib version or commit: [e.g., 0.x / commit hash]
  • numpy / scikit-learn version: [e.g., numpy 1.26.x / scikit-learn 1.4.x]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions