-
Notifications
You must be signed in to change notification settings - Fork 206
Open
Description
Hi,
Thank you very much for this great work!
I have been experimenting with DiffPrivLib using pipelines such as:
dpl_pipeline_with_scaler = Pipeline([
('scaler', models.StandardScaler(epsilon = 0.5, bounds=dpl_bounds)),
('classifier', models.LogisticRegression(epsilon = 1.0, data_norm=log_reg_data_norm))
])
and
kmeans_pipeline = Pipeline([
('scaler', models.StandardScaler(epsilon = 0.5, bounds=dpl_bounds)),
('kmeans', models.KMeans(n_clusters = N_CLUSTERS, epsilon = 5.0, bounds=([-3, -3, -3, -3, -3], [3, 3, 3, 3, 3]))),
])
My question is about how best to set the parameters in the model following the standard scaler:
data_norm(for LogisticRegression for instance)bounds(for KMeans for instance)
For now, I’ve been reasoning as follows:
- After applying StandardScaler, most values should empirically fall in [-3, 3] per feature (using the 68–95–99.7 rule - Wikipedia) (assuming I have a normal distribution..).
- Based on this, I set KMeans bounds to
[-3, 3]per feature. - For the LogisticRegression, I estimate
log_reg_data_norm = np.sqrt(n_features * (3**2)).
However, this approach may not be ideal. Do you have any recommendations for better ways to set these parameters after scaling? Or did I misunderstood and should instead use the original input data bounds (as used in StandardScaler)?
Thanks again for your work on this library!
Metadata
Metadata
Assignees
Labels
No labels