Five correctness issues in diffprivlib (linear regression, k-means, vector mechanisms) + offer to contribute bounded Gaussian mechanism

**Title**
Five correctness issues in `diffprivlib` (linear regression, k-means, vector mechanisms) + offer to contribute bounded Gaussian mechanism

---

**Describe the bug**
There are five places where current behavior appears inconsistent with standard DP sensitivity/accounting or with intended API semantics.

1. **`models/linear_regression.py` → `_construct_regression_obj` (≈ L144)**

* **Current problem:** Feature sensitivity is computed using the lower bound twice, ignoring the upper bound; this can **underestimate** sensitivity on asymmetric ranges.

  ```python
  # current
  sensitivity = np.max(np.abs([bounds_X[0][i], bounds_X[0][i]])) ** 2
  ```
* **Improvement with proposed change:** Use the **maximum absolute endpoint** of the interval, then square:

  ```python
  # proposed
  sensitivity = np.max(np.abs([bounds_X[0][i], bounds_X[1][i]])) ** 2
  ```

  This restores conservative, correct calibration per-coordinate.

2. **`models/linear_regression.py` → `fit` (≈ L295)**

* **Current problem:** DP-centering of **X** and **y** share a single budget slice; composition is not accounted for, which can **over-spend** privacy.

  ```python
  # current
  epsilon_intercept_scale = 1 / (n_features + 1) if self.fit_intercept else 0
  ```
* **Improvement with proposed change:** Split once more by 2 so the two centering operations **compose** correctly:

  ```python
  # proposed
  epsilon_intercept_scale = 1 / ((n_features + 1) * 2) if self.fit_intercept else 0
  ```

  This aligns with basic DP composition for two independent queries.

3. **`models/k_means.py` → `_update_centers` (≈ L238)**

* **Current problem:** `epsilon_0` (init) and `epsilon_i` (per-iteration) appear **swapped**, misallocating noise between initialization and updates.

  ```python
  # current
  epsilon_0, epsilon_i = self._split_epsilon(dims, total_iters)
  ```
* **Improvement with proposed change:** Assign in the intended order so init vs. iterative budgets match semantics:

  ```python
  # proposed
  epsilon_i, epsilon_0 = self._split_epsilon(dims, total_iters)
  ```

  This fixes privacy accounting and typically improves accuracy.

4. **`mechanisms/vector.py` → `randomise` (≈ L150)**

* **Current problem:** Privacy calibration uses a **linear** dependence on the data sensitivity bound; DP-ERM analyses (e.g., CMS11) imply a **quadratic** dependence for the relevant constant, risking **under-calibrated** noise.

  ```python
  # current
  epsilon_p = self.epsilon - 2 * np.log(1 + self.function_sensitivity * self.data_sensitivity / self.alpha)
  ```
* **Improvement with proposed change:** Use **quadratic** dependence:

  ```python
  # proposed
  epsilon_p = self.epsilon - 2 * np.log(1 + self.function_sensitivity * self.data_sensitivity**2 / self.alpha)
  ```

  This restores theoretical consistency in the privacy–utility trade-off.

5. **`mechanisms/vector.py` → `randomise` (≈ L154)**

* **Current problem:** The same linear vs. quadratic scaling issue propagates to `delta`, affecting the per-coordinate failure probability.

  ```python
  # current
  delta = (self.function_sensitivity * self.data_sensitivity / np.expm1(self.epsilon / 4) - self.alpha) / self.n
  ```
* **Improvement with proposed change:** Quadratic scaling for consistency with item (4):

  ```python
  # proposed
  delta = (self.function_sensitivity * self.data_sensitivity**2 / np.expm1(self.epsilon / 4) - self.alpha) / self.n
  ```

  This corrects the calibration of the vector mechanism.

> Note: I have implemented a bounded Gaussian mechanism that is currently missing from the library and can package it to match diffprivlib’s API/style (parameterization, clipping, and accounting consistent with the library’s conventions). If there is interest, I’m happy to contribute this as a PR and discuss tests, documentation, and theoretical notes.

---

**To Reproduce**

1. Inspect the files at the indicated lines to see the current expressions.
2. For (1): Use asymmetric feature bounds; observe sensitivity ignores the upper bound.
3. For (2): Run `LinearRegression(fit_intercept=True)` with DP-centering; note X and y share the same budget slice.
4. For (3): Run `KMeans`; log the split from `_split_epsilon` and observe the swap.
5. For (4–5): Instantiate the vector mechanism; compare calibration with DP-ERM references showing quadratic dependence on the data-norm bound.

---

**Expected behavior**

* Sensitivity per feature uses `max(abs(lower), abs(upper))**2`.
* DP-centering budgets for X and y **compose** (additional split by 2).
* `epsilon_0` (init) and `epsilon_i` (iteration) are assigned in the intended order.
* Vector mechanism calibration reflects **quadratic** dependence on the data sensitivity bound in both `epsilon_p` and `delta`.

---

**Screenshots**
N/A

---

**System information (please complete):**

* OS: \[e.g., Ubuntu 22.04]
* Python version: \[e.g., 3.10.13]
* diffprivlib version or commit: \[e.g., 0.x / commit hash]
* numpy / scikit-learn version: \[e.g., numpy 1.26.x / scikit-learn 1.4.x]


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Five correctness issues in diffprivlib (linear regression, k-means, vector mechanisms) + offer to contribute bounded Gaussian mechanism #104

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Five correctness issues in diffprivlib (linear regression, k-means, vector mechanisms) + offer to contribute bounded Gaussian mechanism #104

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions