Recommendations for setting data_norm and bounds in models after StandardScaler

Hi,
Thank you very much for this great work!

I have been experimenting with DiffPrivLib using pipelines such as:
```
dpl_pipeline_with_scaler = Pipeline([
    ('scaler', models.StandardScaler(epsilon = 0.5, bounds=dpl_bounds)),
    ('classifier', models.LogisticRegression(epsilon = 1.0, data_norm=log_reg_data_norm))
])
```
and
```
kmeans_pipeline = Pipeline([
    ('scaler', models.StandardScaler(epsilon = 0.5, bounds=dpl_bounds)),
    ('kmeans', models.KMeans(n_clusters = N_CLUSTERS, epsilon = 5.0, bounds=([-3, -3, -3, -3, -3], [3, 3, 3, 3, 3]))),
])
```
My question is about how best to set the parameters in the model following the standard scaler:
- `data_norm` (for LogisticRegression for instance)
- `bounds` (for KMeans for instance)

For now, I’ve been reasoning as follows:
- After applying StandardScaler, most values should empirically fall in [-3, 3] per feature (using the [68–95–99.7 rule - Wikipedia](https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule)) (assuming I have a normal distribution..).
- Based on this, I set KMeans bounds to `[-3, 3]` per feature.
- For the LogisticRegression, I estimate `log_reg_data_norm = np.sqrt(n_features * (3**2))`. 

However, this approach may not be ideal. Do you have any recommendations for better ways to set these parameters after scaling? Or did I misunderstood and should instead use the original input data bounds (as used in `StandardScaler`)?

Thanks again for your work on this library!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Recommendations for setting data_norm and bounds in models after StandardScaler #102

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Recommendations for setting data_norm and bounds in models after StandardScaler #102

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions