Skip to content

Commit ebe28d3

Browse files
committed
add datasets for sensitive data session
1 parent 2aad5c2 commit ebe28d3

File tree

5 files changed

+1035
-0
lines changed

5 files changed

+1035
-0
lines changed
Lines changed: 253 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,253 @@
1+
---
2+
title: "Data Anonymization with R's sdcMicro Package"
3+
author: "Renata Goncalves Curty - UCSB Library, Research Data Services"
4+
date: "2025-05-29"
5+
output:
6+
html_document: default
7+
---
8+
9+
```{r setup, include=FALSE}
10+
knitr::opts_chunk$set(echo = TRUE)
11+
```
12+
13+
## South Park Elementary School Data
14+
15+
Mayor McDaniels and Peter Charles (aka PC Principal) are concerned that even after removing direct identifiers such as names, SSNs, and IDs, students may still be easily re-identified in the yearly assessment dataset and have their math and reading scores revealed. For example, everyone in school knows that Tolkien Williams is the wealthiest kid in the whole town, whereas Kenny and his sister Karen are from a very poor family.
16+
17+
They have requested our assistance to compute this risk of disclosure, implement strategies to minimize it, and determine information loss for the anonymized dataset they would like to make public to other school board members\*. They asked for our help, and we will be using the sdcMicro package for this purpose.
18+
19+
In summary, our client has three main questions to for us (and none of them involve finding out who keeps killing Keny and how come he keeps coming back to life):
20+
21+
*Q1. What is the level of disclosure risk associated with this dataset?*
22+
23+
*Q2. How can the risk of re-identification be significantly reduced?*
24+
25+
*Q3. What would be the utility and information loss after implementing the anonymization strategies?*
26+
27+
\*Caveat: We have a relative small dataset for this exercise (rows and columns, so we can't strive for some of the tresholds recommended in the literature.
28+
29+
#### Package & Data
30+
31+
```{r}
32+
#Load package
33+
library(sdcMicro)
34+
```
35+
36+
#### Read the dataset
37+
38+
```{r}
39+
# Read the CSV dataset into a data frame
40+
data <- read.csv("southpark-sdc.csv")
41+
```
42+
43+
#### Taking a closer look at the variables included in this dataset
44+
45+
```{r}
46+
# Show the list of variable names and first rows
47+
head(df)
48+
# Check the structure of the data frame
49+
str(df)
50+
```
51+
#### Planning
52+
53+
To develop a disclosure scenario, you must go beyond understanding the contents of your datasets. Consider the potential motivations of malicious actors, identify the data they might access, and explore how that data—combined with publicly available information—could be linked to your dataset to reveal sensitive information. This involves making assumptions about what external data others might possess. If you're uncertain, it's best to create multiple scenarios based on different assumptions and assess the disclosure risk for each one.
54+
55+
First, let's identify:
56+
What are the direct identifiers present on this dataset?
57+
A: *stu_id*, *SSN*, *name* and, *dob*.
58+
59+
These should be removed or replaced with tokens before we share the dataset openly.
60+
61+
#### Data Prep - Converting variables
62+
63+
Based on the structure of the data frame, we will need to convert some of the variables first.
64+
65+
```{r}
66+
fname = "southpark-sdc.csv"
67+
file <- read.csv(fname)
68+
file <- varToFactor(obj=file, var=c("zip","age", "sex","race","ethn", "snap", "income", "learn_dis","phys_dis"))
69+
#Convert to numeric math_sc and read_sc
70+
file <- varToNumeric(obj=file, var=c("math_sc", "read_sc"))
71+
```
72+
73+
#### Q1. What is the level of disclosure risk associated with this dataset?
74+
75+
To answer this question we have to set up an SDC problem. In other words we must select variables and create an object of class *sdcMicroObj* for the SDC process in *R.*
76+
77+
```{r}
78+
# Select variables for creating sdcMicro object
79+
# All variable names should correspond to the names in the data file
80+
# Select key variables, which in our case are all the categorical variables listed above
81+
82+
sdcInitial <- createSdcObj(dat=file,
83+
keyVars=c("zip","age", "sex","race","ethn", "snap", "income", "learn_dis","phys_dis"),
84+
numVars=c("math_sc", "read_sc"),
85+
weightVar=NULL,
86+
hhId=NULL,
87+
strataVar=NULL,
88+
pramVars=NULL,
89+
excludeVars=c("ssn", "name", "dob"), #For now, we won't include stu_id; we'll revisit this decision shortly.
90+
seed=0,
91+
randomizeRecords=FALSE,
92+
alpha=c(1))
93+
# Summary of object
94+
sdcInitial
95+
```
96+
97+
What about the stu_id? Why we are keeping it?
98+
99+
Check the results below, and the number of observations that violate 2-5 anonymity. What does that mean?
100+
101+
##### Time to calculate the risk of re-identification for the entire dataset
102+
103+
```{r}
104+
# The threshold depends on the size of the dataset and the access control (conservative number for large surveys are 0.04)
105+
sdcInitial@risk$global$risk
106+
```
107+
108+
Was it good?
109+
110+
Let's see if we can get that lowered to less than 15% and a k=5.
111+
112+
We have to get some work done to reduce that. But that would be the first answer to our clients.
113+
114+
We can inspect this issue a little further before moving to the second question.
115+
116+
##### Which observations/subjects have a higher risk to be re-identified?
117+
118+
```{r}
119+
sdcInitial@risk$individual
120+
```
121+
122+
##### How many combinations of key variables each record have?
123+
124+
```{r}
125+
#Categorical variable risk
126+
#Frequency of the particular combination of key variables (quasi-identifiers) for each record in the sample
127+
freq(sdcInitial, type = 'fk')
128+
```
129+
130+
#### Q2. How can the risk of re-identification be significantly reduced?
131+
132+
We learned that there are different techniques to de-identify and anonymize datasets.
133+
134+
First, let's use some non-perturbative methods such as global recoding and top and bottom coding techniques.
135+
136+
*Income*
137+
138+
As mentioned before, the household income of some students may pose a risk to their privacy in this dataset. So let's see if using top and bottom recoding could help reducing that risk.
139+
140+
```{r}
141+
# Frequencies of income before recoding
142+
table(sdcInitial@manipKeyVars$income)
143+
```
144+
145+
```{r}
146+
## Recode variable income (top coding)
147+
sdcInitial <- groupAndRename(obj= sdcInitial, var= c("income"), before=c("200,000-249,999","500,000+"), after=c("200,000+"))
148+
149+
## Recode variable income (bottom coding)
150+
sdcInitial <- groupAndRename(obj= sdcInitial, var= c("income"), before=c("10,000-24,999","75,000-99,999"), after=c("10,000-99,999"))
151+
```
152+
153+
*Age*
154+
155+
```{r}
156+
# Frequencies of age before recoding
157+
table(sdcInitial@manipKeyVars$age)
158+
```
159+
160+
```{r}
161+
#Recode Age (top and bottom)
162+
sdcInitial <- groupAndRename(obj= sdcInitial, var= c("age"), before=c("8", "9", "10"), after=c("8-10"))
163+
sdcInitial <- groupAndRename(obj= sdcInitial, var= c("age"), before=c("11", "12", "13"), after=c("11-13"))
164+
```
165+
166+
##### **Note: Undoing things**
167+
168+
```{r}
169+
# Important note: If the results are reassigned to the same sdcMicro object, it is possible to undo the last step in the SDC process. Using:
170+
# sdcInitial <- undolast(sdcInitial)
171+
# It might be helpful to tune some parameters. The results of the last step, however, will be lost after undoing that step.
172+
# We can also choose to assign results to a new sdcMicro object this time, using:
173+
# sdc1 <- functionName(sdcInitial) specially if you anticipate creating multiple sdc problems to test out.Otherwise, you can delete the object and re-run the code when needed
174+
```
175+
176+
Let's see if those steps lowered the risk of re-identification of subjects.
177+
178+
```{r}
179+
sdcInitial@risk$global$risk
180+
181+
# We could also check for risk for each individual case that exceeds 5%
182+
# sum(sdcInitial@risk$individual[,1] > 0.05)
183+
184+
# Let's print to check it
185+
186+
print(sdcInitial, 'kAnon')
187+
```
188+
189+
Only a tiny improvement compared to the original dataset. Let's try something else.
190+
191+
##### Time for a more powerful technique. Let's use the k-anonymization function!
192+
193+
```{r}
194+
#Local suppression to obtain k-anonymity
195+
sdcInitial <- kAnon(sdcInitial, k=c(5))
196+
197+
# Setting the parameters that we are aiming for at least 5 observations sharing the same attributes in the dataset.
198+
#Alternatively, we could have set the order of importance for each keyvariables
199+
#sdcInitial <- kAnon(sdcInitial, importance=c(9,5,6,7,8,4,3,1,2), k=c(5))
200+
```
201+
202+
More on importance (pg. 50): <https://cran.r-project.org/web/packages/sdcMicro/sdcMicro.pdf>
203+
204+
Time to check it again:
205+
206+
```{r}
207+
sdcInitial@risk$global$risk
208+
```
209+
210+
Alright! We managed lower the risk of identification from 81% to about 10% and now we have 0 observations violating 5-anonymity! We can tell our clients we used some recoding, but supression via k-anonymity was necessary to improve the privacy level of this dataset.
211+
212+
#### Q3. What would be the utility and information loss after implementing anonymization strategies?
213+
214+
##### Time to measure the utility and information loss for the anonymized dataset.
215+
216+
```{r}
217+
#First we retrieve the total suppression actions performed for each key variable
218+
print(sdcInitial, 'ls')
219+
```
220+
221+
```{r}
222+
#We can also compare the number of NAs before and after our interventions
223+
# Store the names of all categorical key variables in a vector
224+
namesKeyVars <- names(sdcInitial@manipKeyVars)
225+
226+
# Matrix to store the number of missing values (NA) before and after anonymization
227+
NAcount <- matrix(NA, nrow = 2, ncol = length(namesKeyVars))
228+
colnames(NAcount) <- c(paste0('NA', namesKeyVars)) # column names
229+
rownames(NAcount) <- c('initial', 'treated') # row names
230+
231+
# NA count in all key variables (NOTE: only those coded NA are counted)
232+
for(i in 1:length(namesKeyVars)) {
233+
NAcount[1, i] <- sum(is.na(sdcInitial@origData[,namesKeyVars[i]]))
234+
NAcount[2, i] <- sum(is.na(sdcInitial@manipKeyVars[,i]))}
235+
236+
# Show results
237+
NAcount
238+
```
239+
240+
Based on the results we can tell PC Principal and the Mayor that the supression greatly reduced the level of detail about the income and the race of the students. We could continue exploring removing other less relevant variables and explore other functions in this package or even considering different ways of recoding that variable. But let's call the day for today, and export the anonymized dataset we produced.
241+
242+
##### Creating a new random number to replace the student ID
243+
244+
```{r}
245+
## Adding a new randomized ID-variable
246+
sdcInitial <- createNewID(sdcInitial, newID="ID", withinVar="stu_id")
247+
```
248+
249+
##### Exporting the anonymized dataset
250+
251+
```{r}
252+
write.csv(extractManipData(sdcInitial), "southpark-anon2.csv", row.names = FALSE)
253+
```

0 commit comments

Comments
 (0)