Skip to content

Recommendations for setting data_norm and bounds in models after StandardScaler #102

@PaulineMauryL

Description

@PaulineMauryL

Hi,
Thank you very much for this great work!

I have been experimenting with DiffPrivLib using pipelines such as:

dpl_pipeline_with_scaler = Pipeline([
    ('scaler', models.StandardScaler(epsilon = 0.5, bounds=dpl_bounds)),
    ('classifier', models.LogisticRegression(epsilon = 1.0, data_norm=log_reg_data_norm))
])

and

kmeans_pipeline = Pipeline([
    ('scaler', models.StandardScaler(epsilon = 0.5, bounds=dpl_bounds)),
    ('kmeans', models.KMeans(n_clusters = N_CLUSTERS, epsilon = 5.0, bounds=([-3, -3, -3, -3, -3], [3, 3, 3, 3, 3]))),
])

My question is about how best to set the parameters in the model following the standard scaler:

  • data_norm (for LogisticRegression for instance)
  • bounds (for KMeans for instance)

For now, I’ve been reasoning as follows:

  • After applying StandardScaler, most values should empirically fall in [-3, 3] per feature (using the 68–95–99.7 rule - Wikipedia) (assuming I have a normal distribution..).
  • Based on this, I set KMeans bounds to [-3, 3] per feature.
  • For the LogisticRegression, I estimate log_reg_data_norm = np.sqrt(n_features * (3**2)).

However, this approach may not be ideal. Do you have any recommendations for better ways to set these parameters after scaling? Or did I misunderstood and should instead use the original input data bounds (as used in StandardScaler)?

Thanks again for your work on this library!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions