|
| 1 | +--- |
| 2 | +title: "Data Anonymization with R's sdcMicro Package" |
| 3 | +author: "Renata Goncalves Curty - UCSB Library, Research Data Services" |
| 4 | +date: "2025-05-29" |
| 5 | +output: |
| 6 | + html_document: default |
| 7 | +--- |
| 8 | + |
| 9 | +```{r setup, include=FALSE} |
| 10 | +knitr::opts_chunk$set(echo = TRUE) |
| 11 | +``` |
| 12 | + |
| 13 | +## South Park Elementary School Data |
| 14 | + |
| 15 | +Mayor McDaniels and Peter Charles (aka PC Principal) are concerned that even after removing direct identifiers such as names, SSNs, and IDs, students may still be easily re-identified in the yearly assessment dataset and have their math and reading scores revealed. For example, everyone in school knows that Tolkien Williams is the wealthiest kid in the whole town, whereas Kenny and his sister Karen are from a very poor family. |
| 16 | + |
| 17 | +They have requested our assistance to compute this risk of disclosure, implement strategies to minimize it, and determine information loss for the anonymized dataset they would like to make public to other school board members\*. They asked for our help, and we will be using the sdcMicro package for this purpose. |
| 18 | + |
| 19 | +In summary, our client has three main questions to for us (and none of them involve finding out who keeps killing Keny and how come he keeps coming back to life): |
| 20 | + |
| 21 | +*Q1. What is the level of disclosure risk associated with this dataset?* |
| 22 | + |
| 23 | +*Q2. How can the risk of re-identification be significantly reduced?* |
| 24 | + |
| 25 | +*Q3. What would be the utility and information loss after implementing the anonymization strategies?* |
| 26 | + |
| 27 | +\*Caveat: We have a relative small dataset for this exercise (rows and columns, so we can't strive for some of the tresholds recommended in the literature. |
| 28 | + |
| 29 | +#### Package & Data |
| 30 | + |
| 31 | +```{r} |
| 32 | +#Load package |
| 33 | +library(sdcMicro) |
| 34 | +``` |
| 35 | + |
| 36 | +#### Read the dataset |
| 37 | + |
| 38 | +```{r} |
| 39 | +# Read the CSV dataset into a data frame |
| 40 | +data <- read.csv("southpark-sdc.csv") |
| 41 | +``` |
| 42 | + |
| 43 | +#### Taking a closer look at the variables included in this dataset |
| 44 | + |
| 45 | +```{r} |
| 46 | +# Show the list of variable names and first rows |
| 47 | +head(df) |
| 48 | +# Check the structure of the data frame |
| 49 | +str(df) |
| 50 | +``` |
| 51 | +#### Planning |
| 52 | + |
| 53 | +To develop a disclosure scenario, you must go beyond understanding the contents of your datasets. Consider the potential motivations of malicious actors, identify the data they might access, and explore how that data—combined with publicly available information—could be linked to your dataset to reveal sensitive information. This involves making assumptions about what external data others might possess. If you're uncertain, it's best to create multiple scenarios based on different assumptions and assess the disclosure risk for each one. |
| 54 | + |
| 55 | +First, let's identify: |
| 56 | +What are the direct identifiers present on this dataset? |
| 57 | +A: *stu_id*, *SSN*, *name* and, *dob*. |
| 58 | + |
| 59 | +These should be removed or replaced with tokens before we share the dataset openly. |
| 60 | + |
| 61 | +#### Data Prep - Converting variables |
| 62 | + |
| 63 | +Based on the structure of the data frame, we will need to convert some of the variables first. |
| 64 | + |
| 65 | +```{r} |
| 66 | +fname = "southpark-sdc.csv" |
| 67 | +file <- read.csv(fname) |
| 68 | +file <- varToFactor(obj=file, var=c("zip","age", "sex","race","ethn", "snap", "income", "learn_dis","phys_dis")) |
| 69 | +#Convert to numeric math_sc and read_sc |
| 70 | +file <- varToNumeric(obj=file, var=c("math_sc", "read_sc")) |
| 71 | +``` |
| 72 | + |
| 73 | +#### Q1. What is the level of disclosure risk associated with this dataset? |
| 74 | + |
| 75 | +To answer this question we have to set up an SDC problem. In other words we must select variables and create an object of class *sdcMicroObj* for the SDC process in *R.* |
| 76 | + |
| 77 | +```{r} |
| 78 | +# Select variables for creating sdcMicro object |
| 79 | +# All variable names should correspond to the names in the data file |
| 80 | +# Select key variables, which in our case are all the categorical variables listed above |
| 81 | +
|
| 82 | +sdcInitial <- createSdcObj(dat=file, |
| 83 | + keyVars=c("zip","age", "sex","race","ethn", "snap", "income", "learn_dis","phys_dis"), |
| 84 | + numVars=c("math_sc", "read_sc"), |
| 85 | + weightVar=NULL, |
| 86 | + hhId=NULL, |
| 87 | + strataVar=NULL, |
| 88 | + pramVars=NULL, |
| 89 | + excludeVars=c("ssn", "name", "dob"), #For now, we won't include stu_id; we'll revisit this decision shortly. |
| 90 | + seed=0, |
| 91 | + randomizeRecords=FALSE, |
| 92 | + alpha=c(1)) |
| 93 | +# Summary of object |
| 94 | +sdcInitial |
| 95 | +``` |
| 96 | + |
| 97 | +What about the stu_id? Why we are keeping it? |
| 98 | + |
| 99 | +Check the results below, and the number of observations that violate 2-5 anonymity. What does that mean? |
| 100 | + |
| 101 | +##### Time to calculate the risk of re-identification for the entire dataset |
| 102 | + |
| 103 | +```{r} |
| 104 | +# The threshold depends on the size of the dataset and the access control (conservative number for large surveys are 0.04) |
| 105 | +sdcInitial@risk$global$risk |
| 106 | +``` |
| 107 | + |
| 108 | +Was it good? |
| 109 | + |
| 110 | +Let's see if we can get that lowered to less than 15% and a k=5. |
| 111 | + |
| 112 | +We have to get some work done to reduce that. But that would be the first answer to our clients. |
| 113 | + |
| 114 | +We can inspect this issue a little further before moving to the second question. |
| 115 | + |
| 116 | +##### Which observations/subjects have a higher risk to be re-identified? |
| 117 | + |
| 118 | +```{r} |
| 119 | +sdcInitial@risk$individual |
| 120 | +``` |
| 121 | + |
| 122 | +##### How many combinations of key variables each record have? |
| 123 | + |
| 124 | +```{r} |
| 125 | +#Categorical variable risk |
| 126 | +#Frequency of the particular combination of key variables (quasi-identifiers) for each record in the sample |
| 127 | +freq(sdcInitial, type = 'fk') |
| 128 | +``` |
| 129 | + |
| 130 | +#### Q2. How can the risk of re-identification be significantly reduced? |
| 131 | + |
| 132 | +We learned that there are different techniques to de-identify and anonymize datasets. |
| 133 | + |
| 134 | +First, let's use some non-perturbative methods such as global recoding and top and bottom coding techniques. |
| 135 | + |
| 136 | +*Income* |
| 137 | + |
| 138 | +As mentioned before, the household income of some students may pose a risk to their privacy in this dataset. So let's see if using top and bottom recoding could help reducing that risk. |
| 139 | + |
| 140 | +```{r} |
| 141 | +# Frequencies of income before recoding |
| 142 | +table(sdcInitial@manipKeyVars$income) |
| 143 | +``` |
| 144 | + |
| 145 | +```{r} |
| 146 | +## Recode variable income (top coding) |
| 147 | +sdcInitial <- groupAndRename(obj= sdcInitial, var= c("income"), before=c("200,000-249,999","500,000+"), after=c("200,000+")) |
| 148 | +
|
| 149 | +## Recode variable income (bottom coding) |
| 150 | +sdcInitial <- groupAndRename(obj= sdcInitial, var= c("income"), before=c("10,000-24,999","75,000-99,999"), after=c("10,000-99,999")) |
| 151 | +``` |
| 152 | + |
| 153 | +*Age* |
| 154 | + |
| 155 | +```{r} |
| 156 | +# Frequencies of age before recoding |
| 157 | +table(sdcInitial@manipKeyVars$age) |
| 158 | +``` |
| 159 | + |
| 160 | +```{r} |
| 161 | +#Recode Age (top and bottom) |
| 162 | +sdcInitial <- groupAndRename(obj= sdcInitial, var= c("age"), before=c("8", "9", "10"), after=c("8-10")) |
| 163 | +sdcInitial <- groupAndRename(obj= sdcInitial, var= c("age"), before=c("11", "12", "13"), after=c("11-13")) |
| 164 | +``` |
| 165 | + |
| 166 | +##### **Note: Undoing things** |
| 167 | + |
| 168 | +```{r} |
| 169 | +# Important note: If the results are reassigned to the same sdcMicro object, it is possible to undo the last step in the SDC process. Using: |
| 170 | +# sdcInitial <- undolast(sdcInitial) |
| 171 | +# It might be helpful to tune some parameters. The results of the last step, however, will be lost after undoing that step. |
| 172 | +# We can also choose to assign results to a new sdcMicro object this time, using: |
| 173 | +# sdc1 <- functionName(sdcInitial) specially if you anticipate creating multiple sdc problems to test out.Otherwise, you can delete the object and re-run the code when needed |
| 174 | +``` |
| 175 | + |
| 176 | +Let's see if those steps lowered the risk of re-identification of subjects. |
| 177 | + |
| 178 | +```{r} |
| 179 | +sdcInitial@risk$global$risk |
| 180 | +
|
| 181 | +# We could also check for risk for each individual case that exceeds 5% |
| 182 | +# sum(sdcInitial@risk$individual[,1] > 0.05) |
| 183 | +
|
| 184 | +# Let's print to check it |
| 185 | +
|
| 186 | +print(sdcInitial, 'kAnon') |
| 187 | +``` |
| 188 | + |
| 189 | +Only a tiny improvement compared to the original dataset. Let's try something else. |
| 190 | + |
| 191 | +##### Time for a more powerful technique. Let's use the k-anonymization function! |
| 192 | + |
| 193 | +```{r} |
| 194 | +#Local suppression to obtain k-anonymity |
| 195 | +sdcInitial <- kAnon(sdcInitial, k=c(5)) |
| 196 | + |
| 197 | +# Setting the parameters that we are aiming for at least 5 observations sharing the same attributes in the dataset. |
| 198 | +#Alternatively, we could have set the order of importance for each keyvariables |
| 199 | +#sdcInitial <- kAnon(sdcInitial, importance=c(9,5,6,7,8,4,3,1,2), k=c(5)) |
| 200 | +``` |
| 201 | + |
| 202 | +More on importance (pg. 50): <https://cran.r-project.org/web/packages/sdcMicro/sdcMicro.pdf> |
| 203 | + |
| 204 | +Time to check it again: |
| 205 | + |
| 206 | +```{r} |
| 207 | +sdcInitial@risk$global$risk |
| 208 | +``` |
| 209 | + |
| 210 | +Alright! We managed lower the risk of identification from 81% to about 10% and now we have 0 observations violating 5-anonymity! We can tell our clients we used some recoding, but supression via k-anonymity was necessary to improve the privacy level of this dataset. |
| 211 | + |
| 212 | +#### Q3. What would be the utility and information loss after implementing anonymization strategies? |
| 213 | + |
| 214 | +##### Time to measure the utility and information loss for the anonymized dataset. |
| 215 | + |
| 216 | +```{r} |
| 217 | +#First we retrieve the total suppression actions performed for each key variable |
| 218 | +print(sdcInitial, 'ls') |
| 219 | +``` |
| 220 | + |
| 221 | +```{r} |
| 222 | +#We can also compare the number of NAs before and after our interventions |
| 223 | +# Store the names of all categorical key variables in a vector |
| 224 | +namesKeyVars <- names(sdcInitial@manipKeyVars) |
| 225 | +
|
| 226 | +# Matrix to store the number of missing values (NA) before and after anonymization |
| 227 | +NAcount <- matrix(NA, nrow = 2, ncol = length(namesKeyVars)) |
| 228 | +colnames(NAcount) <- c(paste0('NA', namesKeyVars)) # column names |
| 229 | +rownames(NAcount) <- c('initial', 'treated') # row names |
| 230 | +
|
| 231 | +# NA count in all key variables (NOTE: only those coded NA are counted) |
| 232 | +for(i in 1:length(namesKeyVars)) { |
| 233 | + NAcount[1, i] <- sum(is.na(sdcInitial@origData[,namesKeyVars[i]])) |
| 234 | + NAcount[2, i] <- sum(is.na(sdcInitial@manipKeyVars[,i]))} |
| 235 | +
|
| 236 | +# Show results |
| 237 | +NAcount |
| 238 | +``` |
| 239 | + |
| 240 | +Based on the results we can tell PC Principal and the Mayor that the supression greatly reduced the level of detail about the income and the race of the students. We could continue exploring removing other less relevant variables and explore other functions in this package or even considering different ways of recoding that variable. But let's call the day for today, and export the anonymized dataset we produced. |
| 241 | + |
| 242 | +##### Creating a new random number to replace the student ID |
| 243 | + |
| 244 | +```{r} |
| 245 | +## Adding a new randomized ID-variable |
| 246 | +sdcInitial <- createNewID(sdcInitial, newID="ID", withinVar="stu_id") |
| 247 | +``` |
| 248 | + |
| 249 | +##### Exporting the anonymized dataset |
| 250 | + |
| 251 | +```{r} |
| 252 | +write.csv(extractManipData(sdcInitial), "southpark-anon2.csv", row.names = FALSE) |
| 253 | +``` |
0 commit comments