Mitigating Bias in AI Algorithms: A Healthcare Guide to Threshold Adjustment

2025-04-11

Introduction

The annotated code presented here is meant to accompany our detailed playbook which aims to address the gap in accessible, practical tools for mitigating algorithmic bias in healthcare by illustrating a simple method of subgroup threshold adjustment that can improve the fairness of both commercial and in-house classification models in the EHR.

# in-line comments are provided within code chunks where you need to edit parameters specific to your data/use-case.

This resource is the fruit of collaboration across NYC Health + Hospitals and New York University’s Center for Health Data Science and Grossman School of Medicine’s Department of Population Health, with funding from Schmidt Futures and Rockefeller Philanthropy Advisors.

Step 1: Prepare
Step 2: Assess Bias
Step 3: Mitigate Bias
Step 4: Assess Mitigation Success
________________________________________________________________________________

Step 1: Prepare

Setup Your Environment

# If the following libraries are not already installed, do so before loading them:
library(tidyverse) 
library(ROCR) 
library(janitor)
library(dplyr)
library(tidyr)
library(glue)
library(ggplot2)
library(ggpubr)
library(purrr)
library(DBI)
library(odbc)
library(patchwork)
library(knitr)
library(readr)
library(tidyverse) 
library(binom)
library(kableExtra)
library(yardstick)
library(ROCR)
library(pROC)
library(reader)

Get Your Data

There are two ways to pull your dataset.

Connect to your system’s database. We suggest using an ODBC connection and DBI interface with a SQL query. This is the preferred method, as it does not require saving any PHI locally, but since we are using a synthetic dataset for the purposes of this playbook, we are using Option 2:
Connect to a csv flat file. We’re calling our dataset ‘BIA’ for Bias In Algorithms. You can call yours whatever you like, but if you do change it, be sure to replace all other occurrences of ‘BIA’ in the code with your new name.

BIA <- read_csv("BIA_Synthetic_Data.csv") # Replace with your file name

Clean Your Data

Your data needs to contain the following columns:
* ‘id’ (unique ID)
* ‘proba’ (probability (risk score) between ‘0’ and ‘1’)
* ‘outcome’ (binary outcome, where ‘1’ is the outcome we are trying to predict and ‘0’ is the absence of that outcome)
* and the sensitive variables you want to assess for bias. We’re using ‘race’,‘sex’, ‘age_cat’, ‘language’, and ‘insurance’

If you need to rename any variables, do so now.

# Clean your data so that your probability variable is 'proba' 
# Rename any of your sensitive variables how you'd like. For us, that's updating our insurance and age names: 
BIA = janitor::clean_names(BIA) %>% # We're making all variable names lowercase,
  rename(proba = score, # Renaming our score variable as 'proba',
         insurance = financial_class, # Renaming our financial_class variable as 'insurance',
         age = age_category) # Renaming our age_category variable as 'age'

Check to make sure your outcome variable is binary. If it’s not, combine outcome categories so that it is.

categories <- unique(BIA$outcome) 
knitr::kable(categories, col.names = "Outcomes")

Outcomes
0
1

Great, we have two outcome values. We can proceed!

Define Your 'Sensitive' (Sociodemographic) Variables

List categorical variables of interest here by their column names in the above data frame.

sensitive = list('race', 'sex', 'age', 'language', 'insurance') # Replace variables as needed

Review ‘n’ counts to identify subgroups comprising less than 1% of the population

knitr::kable(lapply(sensitive, function(s){
  BIA %>% count_(s) %>% mutate(fraction = n/sum(n)) %>% arrange(fraction)
}), caption = "Subgroup counts")

Subgroup counts

race	n	fraction
Something Else	2826	0.05652
Asian	5619	0.11238
Native American	5631	0.11262
Unknown	5688	0.11376
Two or more races	5781	0.11562
African American	5963	0.11926
White	8739	0.17478
Hispanic	9753	0.19506

sex	n	fraction
Male	24003	0.48006
Female	25997	0.51994

age	n	fraction
>45	16530	0.33060
25-45	16718	0.33436
<25	16752	0.33504

language	n	fraction
Other	6886	0.13772
Spanish	19273	0.38546
English	23841	0.47682

insurance	n	fraction
No Fault	6454	0.12908
Medicare	6524	0.13048
Self-Pay	6538	0.13076
Other Government	6588	0.13176
Medicaid	6619	0.13238
Workers Comp	6659	0.13318
Commercial	10618	0.21236

All our subgroups comprise >1% so we can proceed, but if they did not, this would be the step where we logically combine groups or drop those too small for independent analysis.

Establish Thresholds

After a review of model documentation and interviews with clinicians, set your probability score cutoffs (thresholds) indicating levels of risk. For us, the following thresholds were identified:

Low risk: 0%-7.9%
Med risk: 8%-14.9%
High risk: 15%-100%

You only need one threshold, but if you are considering your options or evaluating low, high, and medium risk thresholds, you can list multiple cutoffs at this stage.

thresholds = c(0.08, 0.15) # Specify your threshold(s) as values between 0 and 1

Step 2: Assess Bias

Probability Distributions

To orient ourselves to the data, we first look at distributions of the probability of the outcome for each subgroup within each class (sensitive variable/sociodemographic group). The thresholds we set above are output as vertical lines, so we can get a visual sense of how different thresholds flag different subgroups.

lapply(sensitive, function(s){
 p = BIA %>% ggplot(aes(x = proba, color = .data[[!!s]])) + 
    geom_vline(xintercept = thresholds) + 
    geom_density(bounds = c(0, 1), linewidth = 2) +
    ggtitle(paste("Probability distribution for",s))
})

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

Assess Overall Performance

First, we define some helper functions to help us output: (1) Receiver Operating Characteristic (ROC) with thresholds highlighted as dots, (2) Precision-Recall Curve (PRC) with thresholds highlighted as dots, (3) Calibration curve with thresholds highlighted as vertical lines.

# You don't need to edit the parameters in any of the below, just run this chunk to define the functions.

## Area under curve with trapezoid 
helper.trapezoid <- function(df) {
  if(!all(c("x", "y") %in% colnames(df))) {
    stop('df must have x and y columns')
  }
  df.trap <- df %>% 
    arrange(x) %>% 
    filter(!is.na(x), !is.na(y)) %>% 
    mutate(x.diff = x - lag(x, 1),
           y.mean = (y + lag(y, 1))/2,
           xy = x.diff * y.mean) 
  auc <- df.trap %>% 
    summarize(auc = sum(xy, na.rm = T)) %>% .$auc
  return(auc)
}

## ROC curve and area under 
analytic.calc_roc = function(predictions, labels, group_name = NA_character_){
  if(length(unique(labels)) == 1){
    ## shortcut to return nothing.
    return(tibble(group = group_name,
                  auroc = NA, 
                  roc = list(tibble(group = group_name)) ) )
  }
  pred = ROCR::prediction(predictions, labels)
  perf = ROCR::performance(pred, 'auc')
  res.auc = perf@y.values[[1]]
  
  perf = ROCR::performance(pred, 'tpr', 'fpr')
  plot.roc = tibble(cutoff = perf@alpha.values[[1]], 
                    tpr = perf@y.values[[1]],
                    fpr = perf@x.values[[1]] )
  if(!is.na(group_name)){
    plot.roc = plot.roc %>% mutate(group = group_name)
  }else{plot.roc = plot.roc %>% mutate(group = 'total')}
  return(tibble(group = group_name,
                auroc = res.auc, 
                roc = list(plot.roc)))
}

## Precision-recall curve and area under
analytic.calc_prc = function(predictions, labels, group_name = NA_character_){
  if(length(unique(labels)) == 1){
    ## shortcut to return nothing.
    return(tibble(group = group_name,
                  auprc = NA, 
                  prc = list(tibble(group = group_name)) ) )
  }
  pred = ROCR::prediction(predictions, labels)
  perf = ROCR::performance(pred, "prec", "rec")
  perf_npv = ROCR::performance(pred, "npv", 'rpp') 
  
  plot.prc = tibble(cutoff = perf@alpha.values[[1]],
                    preCIsion.raw = perf@y.values[[1]],
                    recall = perf@x.values[[1]],
                    npv = perf_npv@y.values[[1]],
                    alert_rpp = perf_npv@x.values[[1]]) %>% 
    arrange(cutoff) %>% 
    mutate(preCIsion = cummax(preCIsion.raw) ) %>% 
    select(cutoff, preCIsion, recall, preCIsion.raw, npv, alert_rpp) %>% 
    arrange(desc(cutoff))
  if(!is.na(group_name)){
    plot.prc = plot.prc %>% mutate(group = group_name)
  }else{plot.prc = plot.prc %>% mutate(group = 'total')}
  res.auprc <- helper.trapezoid(plot.prc %>% select(x = recall, y = preCIsion))
  return(tibble(group = group_name,
                auprc = res.auprc, 
                prc = list(plot.prc)) )
}

## Calibration curves 
analytic.form_ntiles = function(df, group_var = NA, groups = 10, z = 1.96, percentile_range = c(0, 1)){
  if(is.na(group_var)){
    df = df %>% mutate(group = 'total')
  }else{
    df = df %>% rename(group = !!group_var)
  }
  df %>% 
    group_by(group) %>% 
    mutate(chunk = ntile(proba, n = groups), center = (chunk * (100/groups) - ((100/groups)/2)) ) %>% 
    filter(center >= first(percentile_range)*100, chunk <= last(percentile_range)*100) %>% 
    group_by(group, center) %>% 
    summarize(label_mean = mean(outcome),
              model_prediction_mean = mean(proba),
              n = n()) %>% ungroup() %>% 
    mutate(se = sqrt((label_mean*(1-label_mean))/n),
           lower = label_mean - z*se, 
           upper = label_mean + z*se) %>%
    mutate(lower = pmax(lower, 0), upper = pmin(upper, 1))
}

Now, we use the above functions to get the AUC, PR-AUC, and Calibration Curve for our data.

# Again, you don't need to edit the parameters in this chunk as long as you've maintained our nomenclature, just run it.
# Area under
total.roc = analytic.calc_roc(BIA$proba, BIA$outcome)
writeLines(glue::glue('Area under the ROC curve is: {round(100*first(total.roc$auroc), 2)}%'))

## Area under the ROC curve is: 86.14%

total.prc = analytic.calc_prc(BIA$proba, BIA$outcome)
writeLines(glue::glue('Area under the Precision-Recall curve for is: {round(100*first(total.prc$auprc), 2)}%'))

## Area under the Precision-Recall curve for is: 44.7%

# ROC
temp.a = total.roc %>% pull(roc) %>% bind_rows()
a = temp.a %>% 
  ggplot(aes(x = fpr, y = tpr, color = group)) + 
  geom_vline(xintercept = c(0, 1)) + geom_hline(yintercept = c(0, 1)) + 
  geom_abline(color = 'grey50', linetype = 'dashed') + 
  geom_line(size = 1.5) + 
  geom_point(data = lapply(thresholds, function(t){temp.a %>% arrange(desc(cutoff)) %>% filter(cutoff <= t) %>% slice(1)}) %>% bind_rows(), mapping = aes(fill = group), size = 3, stroke = 0.8, pch = 21, color = 'grey20') + theme_minimal() + coord_fixed() + ggtitle('AUROC') 

#PRC
temp.b = total.prc %>% pull(prc) %>% bind_rows() 
b = temp.b %>%
  ggplot(aes(x = recall, y = preCIsion, color = group)) + 
  geom_vline(xintercept = c(0, 1)) + 
  geom_hline(yintercept = c(0, 1)) + 
  geom_hline(data = temp.b %>% arrange(cutoff) %>% group_by(group) %>% slice(1) %>% ungroup(),
             mapping = aes(yintercept = preCIsion),
             color = 'grey50', linetype = 'dashed') + 
  geom_point(inherit.aes = T, mapping = aes(y = preCIsion.raw), size = 1, alpha = 0.1) + 
  geom_line(size = 1.5) + 
  geom_point(data = lapply(thresholds, function(t){temp.b %>% arrange(desc(cutoff)) %>% filter(cutoff <= t) %>% slice(1)}) %>% bind_rows(), 
             mapping = aes(fill = group), size = 3, stroke = 0.8, pch = 21, color = 'grey20') + 
  theme_minimal() + coord_fixed() + ggtitle('PR-AUC')

#Calibration
c = analytic.form_ntiles(BIA, groups =10) %>% 
  ggplot(aes(x = model_prediction_mean, y = label_mean, color = group)) + 
  #facet_grid(. ~ group) + 
  geom_vline(xintercept = c(0, 1)) + 
  geom_hline(yintercept = c(0, 1)) + # bounding box
  geom_abline(slope = 1, intercept = 0, color = 'grey50', linetype = 'dashed') + 
  geom_vline(xintercept = thresholds, color = 'grey20') + 
  geom_point() + 
  geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.02) + 
  geom_smooth(method= 'lm', color = 'grey20', se = FALSE) + 
  theme_minimal() + 
  coord_fixed() + 
  ylim(0, 1) + 
  xlim(0, 1) + 
  xlab('Mean Estimated Risk') + ylab('Mean Observed Positives') + ggtitle('Calibration')
  plot = ggarrange(a, b, c, ncol = 3, common.legend = TRUE, legend = 'none')
  plot

Assess Overall Model Performance Within Each Class

Now, we look at AUC and calibration for each subgroup within each class. Keep in mind, disparities in AUROC suggest the model functions (“discriminates”, in the good sense) better or worse for some subgroups (e.g., due to representation within the data, availability of data, etc.). Because PR-AUC is swayed by outcome prevalence, and we thus expect groups with higher outcome rates to have higher PR-AUCs, we calculate them but don’t use them in our assessment of subgroup performance here.

# Again, you don't need to edit the parameters in this chunk as long as you've maintained our nomenclature, just run it.
performances = lapply(sensitive, function(s){
  print(s)
    g = BIA %>% group_by_(s)
    df.list = g %>% group_split() 
    s.values = group_keys(g) %>% unlist() 
    if(length(df.list) == 1){
      warning(glue::glue("a group of size 1 found for variable={s}, possibly an error"))
    }
    # For each value within the sensitive variable, e.g. for male within sex, calculate ROC and PRC
    temp.roc = mapply(function(temp.df, value){
      analytic.calc_roc(temp.df$proba, temp.df$outcome, group_name = value ) %>% 
        mutate(n_group = nrow(temp.df)) # subgroup sample size for clarity
    }, df.list, s.values ) %>% t() %>% as.tibble()
    temp.prc = mapply(function(temp.df, value){
      analytic.calc_prc(temp.df$proba, temp.df$outcome, group_name = value)
    }, df.list, s.values ) %>% t() %>% as.tibble()
    # Combine globals into one output of AUROCs and PR-AUCs
    global = left_join(temp.roc %>% select(group, n_group, auroc) %>% tidyr::unnest(cols = c(group, n_group, auroc)), temp.prc %>% select(group, auprc) %>% tidyr::unnest(cols = c(group, auprc)), by = join_by(group)) %>% mutate(group = glue::glue('{s}={group}'))
    # For plotting, unpack long list of each point along ROC and PR curves
    temp.a = temp.roc %>% pull(roc) %>% bind_rows() 
    temp.b = temp.prc %>% pull(prc) %>% bind_rows() 
    local = left_join(lapply(thresholds, function(t){temp.a %>% group_by(group) %>% arrange(desc(cutoff)) %>% 
        mutate(threshold_desired = t) %>% filter(cutoff <= threshold_desired) %>% slice(1)}) %>% bind_rows() , 
        lapply(thresholds, function(t){temp.b %>% group_by(group) %>% arrange(desc(cutoff)) %>% 
            mutate(threshold_desired = t) %>% filter(cutoff <= threshold_desired) %>% slice(1)}) %>% bind_rows() , #%>% mutate(curve = 'prc')
        by = c('group', 'threshold_desired'), suffix = c(".roc",  ".prc") ) %>% 
      mutate(variable = s) %>% 
      select(variable, group, threshold_desired, everything())
    # Generate graphs of a=ROC, b = PRC, c = calibration as above
    a = temp.a %>% 
      ggplot(aes(x = fpr, y = tpr, color = group)) + 
      geom_vline(xintercept = c(0, 1)) + geom_hline(yintercept = c(0, 1)) + 
      geom_abline(color = 'grey50', linetype = 'dashed') + 
      geom_line(size = 1.5) + 
      geom_point(data = local, 
                 mapping = aes(fill = group), size = 2, stroke = 0.8, pch = 21, color = 'grey20') + 
      theme_minimal() + coord_fixed() + 
      labs(color = s) +
      ggtitle(glue::glue("ROC Curve")) 
    b = temp.b %>%
      ggplot(aes(x = recall, y = preCIsion, color = group)) + 
      geom_vline(xintercept = c(0, 1)) + geom_hline(yintercept = c(0, 1)) + 
      geom_hline(data = . %>% arrange(cutoff) %>% group_by(group) %>% slice(1) %>% ungroup(),
                 mapping = aes(yintercept = preCIsion, color = group), #color = 'grey50'
                 linetype = 'dashed') + 
      geom_point(inherit.aes = T, mapping = aes(y = preCIsion.raw), size = 1, alpha = 0.1) + 
      geom_line(size = 1.5) + 
      geom_point(data = local, 
                 mapping = aes(fill = group), size = 2, stroke = 0.8, pch = 21, color = 'grey20') + 
      theme_minimal() + coord_fixed() + 
      guides(color = "none") +
      ggtitle(glue::glue("PRC Curve"))
    # Combine ROC and PR curves into one side-by-side
    ab = ggpubr::ggarrange(a, b, legend = 'none')
    # Calibration curves
    c = analytic.form_ntiles(BIA, groups = 10, group_var = s) %>% 
      ggplot(aes(x = model_prediction_mean, y = label_mean, color = group)) + 
      facet_grid(. ~ group) + 
      geom_vline(xintercept = c(0, 1)) + geom_hline(yintercept = c(0, 1)) + 
      geom_abline(slope = 1, intercept = 0, color = 'grey50', linetype = 'dashed') + 
      geom_vline(xintercept = thresholds, color = 'grey20') + 
      geom_point() + geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.02) + 
      geom_smooth(method= 'lm', color = 'grey20', se = FALSE) + 
      theme_minimal() + coord_fixed() + 
      ylim(0, 1) + xlim(0, 1) + 
      xlab('Mean Estimated Risk') + ylab('Mean Observed Positives') +
      ggtitle(glue::glue("Calibration Curve"))
    # Combine calibration below ROC+PR into one giant figure.
    fig = ggpubr::ggarrange(ab, c, ncol = 1, nrow = 2, common.legend = TRUE, legend = 'bottom', heights = c(1, 1.2) ) 
    print(fig)
    return(tibble(global=list(global), local=list(local)) ) 
  }) %>% bind_rows()

## [1] "race"

## [1] "sex"

## [1] "age"

## [1] "language"

## [1] "insurance"

# Unpack the listed outputs from above into dataframe
global_performances = performances %>% pull(global) %>% bind_rows() # output a table presenting the total AUROC and PR-AUC for each group within each sensitive variable. 
knitr::kable(global_performances)

group	n_group	auroc	auprc
race=African American	5963	0.7683917	0.4266475
race=Asian	5619	0.9401181	0.5486946
race=Hispanic	9753	0.8290706	0.4097390
race=Native American	5631	0.9178439	0.4846723
race=Something Else	2826	0.8505488	0.3920923
race=Two or more races	5781	0.9227523	0.4387840
race=Unknown	5688	0.9259316	0.4638295
race=White	8739	0.9308959	0.5440543
sex=Female	25997	0.8576664	0.4612216
sex=Male	24003	0.8635712	0.4334274
age=25-45	16718	0.8615517	0.4521086
age=<25	16752	0.8719933	0.4709538
age=>45	16530	0.8494339	0.4195285
language=English	23841	0.8648657	0.4672416
language=Other	6886	0.8644197	0.4380724
language=Spanish	19273	0.8547416	0.4234835
insurance=Commercial	10618	0.8633560	0.4471157
insurance=Medicaid	6619	0.8030681	0.3897112
insurance=Medicare	6524	0.9063984	0.4815773
insurance=No Fault	6454	0.7857614	0.4086432
insurance=Other Government	6588	0.9317875	0.5100215
insurance=Self-Pay	6538	0.8781859	0.4648471
insurance=Workers Comp	6659	0.9003593	0.4868387

Assess Fairness Metrics by Class

Fairness metrics allow us to measure model bias at specific thresholds. This quantifies the real-world impact of our model on our patients at that threshold. Our fairness metric here is Equal Opportunity (subgroup FNR - referent FNR). Equal Opportunity measures whether people who will experience the outcome have a fair chance of being flagged as high risk at our chosen threshold.

Create a dataframe of performance metrics and their confidence intervals (CIs) using the Agresti-Coull method for use in our bias check below.

# Again, you don't need to edit the parameters in this chunk as long as you've maintained our nomenclature, just run it.
CIs = lapply(sensitive, function(s){
    lapply(thresholds, function(t){
      loc_ci = BIA %>% mutate(prediction = as.integer(proba >= t) ) %>% group_by_(s) %>% 
        summarize(threshold = t, total = n(), 
                  pos = sum(outcome), neg = total - pos, # actual positive and actual negative
                  pp = sum(prediction), pn = total - pp, # predicted P and N where total = P+N
                  tp = sum((prediction + outcome) == 2), fp = pp - tp, # Predicted P = True + False P
                  tn = sum((prediction + outcome) == 0), fn = pn - tn,
                  neg_check = fp + tn, pos_check = tp + fn
        ) %>% 
        mutate(prev_percent = (round((pos/total) * 100, digits=4)), 
               alert_rate = pp/total, 
               ppv_precision = tp/pp, ppv_ci = binom.confint(tp, pp, method = 'ac')[4:6], 
               npv = tn/pn, 
               tpr_recall_sensitivity = tp/(pos), tpr_ci = binom.confint(tp, pos, method = 'ac')[4:6],
               tnr_specificity = tn/(neg), 
               fpr_type1 = fp/(neg),
               fnr_type2 = fn/(pos), fnr_ci = binom.confint(fn, pos, method = 'ac')[4:6] 
  )
    }) %>% bind_rows()
  }) %>% bind_rows()

Create a function called ‘check’ to help us check for bias.

#  Again, you don't need to edit the parameters in this chunk as long as you've maintained our nomenclature, just run it.
bias_check = function(l_perfs, variable_colname, group_colname, reference_group, p_threshold = 0.05, fairness_metric="not FNR"){
  if(!(variable_colname %in% colnames(l_perfs)) ){warning(glue::glue("Could not find the supplied variable column named={variable_colname}")); return(FALSE)}
  if(!(group_colname %in% colnames(l_perfs)) ){warning(glue::glue("Could not find the supplied group column named={group_colname}")); return(FALSE)}
  if(!(reference_group %in% unlist(l_perfs[group_colname])) ){
    t_output = paste0(unlist(l_perfs[group_colname]), collapse = ", ")
    warning(glue::glue("Could not find the reference group={reference_group} specified in the set of values: [{t_output}]"))
    return(FALSE)}
  if(!('pos' %in% colnames(l_perfs))){warning("Could not find the # of positive cases in the expected column named=pos"); return(FALSE)}
  if(!('total' %in% colnames(l_perfs))){warning("Could not find the # of total cases in the expected column named=total"); return(FALSE)}
  
  ref = l_perfs[which(unlist(l_perfs[group_colname]) == reference_group), variable_colname][[1]] %>% 
    bind_cols(l_perfs[which(unlist(l_perfs[group_colname]) == reference_group), c('pos', 'total')] %>% 
                mutate(prev = pos/total)) # unpack pos and total and calculate prev
  output = l_perfs[variable_colname][[1]] %>% 
    bind_cols(l_perfs %>% select(pos, total) %>% mutate(prev = pos/total)) %>% # row order assumed, # unpack pos and total and calculate prev
    mutate(delta_sign = sign(mean - ref$mean),
           eod = abs(mean - ref$mean)) %>% 
    mutate(lower_outside = !between(lower, first(ref$lower), first(ref$upper)),
           upper_outside = !between(upper, first(ref$lower), first(ref$upper)),
           outside_ci = lower_outside & upper_outside) %>% 
    rowwise() %>% # perform following mutate() one row at a time, CRUCIAL to not vectorize the prop.test logic
    mutate(p_value = prop.test(x=c(pos, ref$pos[1]), n=c(total, ref$total[1]))$p.value,
           prev_delta_sign = sign(prev - ref$prev[1])) %>% 
    ungroup() %>% # the ungroup stops the rowwise
    mutate(difference_check = eod > 0.05) %>%

    mutate(five_pp_diff = if_else(eod > 0.05, "BIAS", "NO BIAS")) %>% 
     mutate(ci_check = if_else(eod > 0.05 & outside_ci, "BIAS", "NO BIAS")) %>% 
    mutate(
      prev_delta_sign = if_else(p_value > p_threshold, 0, prev_delta_sign), # if prevs similar, sign = 0, a fuzzy similar
      composite_check = case_when( 
             ci_check == "NO BIAS"| is.na(ci_check) ~ "NO BIAS", # if first two checks are not TRUE -> output FALSE (aka no evidence of bias).
             # if p > threshold, the prevalences are similar, all we need is a non-overlapping CI from the ci_check 
             ci_check == "BIAS" & p_value > p_threshold ~ "BIAS", # note the ci_check == TRUE is guaranteed from the above line
             # if p <= threshold, the prevalences are different, either a contrasting trend or equivalency 
             p_value <= p_threshold & abs(prev_delta_sign - delta_sign) == if_else(fairness_metric == 'FNR',0,2) ~ "BIAS", 
             p_value <= p_threshold ~ "NO BIAS", # catch where abs() == 0 or 1
             TRUE ~ NA # if the logic has a flaw, output NA
           ),
           directionality = case_when(
             p_value <= p_threshold & abs(prev_delta_sign - delta_sign) == if_else(fairness_metric == 'FNR',0,2) ~ TRUE, 
             p_value <= p_threshold ~ FALSE,
             TRUE ~ NA))
  output$group = unlist(l_perfs[group_colname])
  return(output %>% select(group, everything()) )
}

Using the dataframe and function from above, we’ll check the equal opportunity differences of subgroups for each class. You do need to edit this chunk; refer to the in-line comments to do so.

# Pay attention to the in-line comments in this chunk, as edits are required for your data/use-case: 

EOD_race = CIs %>% filter(!is.na(race), threshold == 0.15) %>% # If you defined multiple thresholds in Step 1, list the one you want to use for analysis here.
  bias_check('fnr_ci', 'race', 'Hispanic', fairness_metric='FNR') %>% # If you are using different sensitive variables than we are, change the second value to your first variable of interest; change the third value to its reference group. 
  mutate(
    five_pp_diff = if_else(group == 'Hispanic', "REF", five_pp_diff) # Replace the value of 'group' with your reference group
  ) %>%  
  select('group','mean','eod','five_pp_diff','total','pos','prev') %>%
  rename('fnr' = mean) 
knitr::kable(EOD_race, caption = "Equal Opportunity Differences, Race", digits=4) %>%
  kableExtra::kable_styling() %>%
  kableExtra::column_spec(2:4, background = "lightgray") %>%
  kableExtra::column_spec(5:7, background = "darkgray")

Equal Opportunity Differences, Race

group	fnr	eod	five_pp_diff	total	pos	prev
African American	0.5261	0.2564	BIAS	5963	937	0.1571
Asian	0.0719	0.1979	BIAS	5619	473	0.0842
Hispanic	0.2697	0.0000	REF	9753	975	0.1000
Native American	0.1217	0.1480	BIAS	5631	460	0.0817
Something Else	0.2876	0.0179	NO BIAS	2826	226	0.0800
Two or more races	0.1259	0.1438	BIAS	5781	413	0.0714
Unknown	0.1247	0.1451	BIAS	5688	393	0.0691
White	0.1640	0.1057	BIAS	8739	817	0.0935

CIs %>% filter(!is.na(race), threshold == 0.15) %>% # If you defined multiple thresholds in Step 1, list the one you want to use for analysis here.
  bias_check('fnr_ci', 'race', 'Hispanic', fairness_metric='FNR') %>% # If you are using different sensitive variables than we are, change the second value to your first variable of interest; change the third value to its reference group. 
  mutate(label = paste0(as.character(round(prev*100, 1)), '%')) %>%
  ggplot(aes(x=group, y = mean, ymin = lower, ymax = upper, color = group)) + 
  geom_linerange() + geom_point() + 
  theme(axis.line=element_blank(), axis.text.x=element_blank()) + 
  ggtitle("FNRs for race") +
  geom_text(aes(x=group, y=0, label=label), show.legend=FALSE)

EOD_sex = CIs %>% filter(!is.na(sex), threshold == 0.15) %>% # If you defined multiple thresholds in Step 1, list the one you want to use for analysis here.
  bias_check('fnr_ci', 'sex', 'Female', fairness_metric='FNR') %>% # If you are using different sensitive variables than we are, change the second value to your first variable of interest; change the third value to its reference group. 
    mutate(
    five_pp_diff = if_else(group == 'Female', "REF", five_pp_diff)) %>%  # Replace the value of 'group' with your reference group
    select('group','mean','eod','five_pp_diff','total','pos','prev') %>%
    rename('fnr' = mean) 
knitr::kable(EOD_sex, caption = "Equal Opportunity Differences, Sex", digits=4) %>%
  kableExtra::kable_styling() %>%
  kableExtra::column_spec(2:4, background = "lightgray") %>%
  kableExtra::column_spec(5:7, background = "darkgray")

Equal Opportunity Differences, Sex

group	fnr	eod	five_pp_diff	total	pos	prev
Female	0.2346	0.0000	REF	25997	2647	0.1018
Male	0.2565	0.0219	NO BIAS	24003	2047	0.0853

CIs %>% filter(!is.na(sex), threshold == 0.15) %>% # If you defined multiple thresholds in Step 1, list the one you want to use for analysis here.
  bias_check('fnr_ci', 'sex', 'Female', fairness_metric='FNR') %>% # If you are using different sensitive variables than we are, change the second value to your first variable of interest; change the third value to its reference group. 
  mutate(label = paste0(as.character(round(prev*100, 1)), '%')) %>%
  ggplot(aes(x=group, y = mean, ymin = lower, ymax = upper, color = group)) + 
  geom_linerange() + geom_point() + 
  theme(axis.line=element_blank(), axis.text.x=element_blank()) + 
  ggtitle("FNRs for sex") +
  geom_text(aes(x=group, y=0, label=label), show.legend=FALSE)

EOD_language = CIs %>% filter(!is.na(language), threshold == 0.15) %>% # If you defined multiple thresholds in Step 1, list the one you want to use for analysis here.
  bias_check('fnr_ci', 'language', 'English', fairness_metric='FNR') %>% # If you are using different sensitive variables than we are, change the second value to your first variable of interest; change the third value to its reference group. 
  mutate(
    five_pp_diff = if_else(group == 'English', "REF", five_pp_diff) # Replace the value of 'group' with your reference group
  ) %>%  
  select('group','mean','eod','five_pp_diff','total','pos','prev') 
knitr::kable(EOD_language, caption = "Equal Opportunity Differences, Language", digits=4) %>%
  kableExtra::kable_styling() %>%
  kableExtra::column_spec(2:4, background = "lightgray") %>%
  kableExtra::column_spec(5:7, background = "darkgray")

Equal Opportunity Differences, Language

group	mean	eod	five_pp_diff	total	pos	prev
English	0.2415	0.0000	REF	23841	2381	0.0999
Other	0.2264	0.0151	NO BIAS	6886	614	0.0892
Spanish	0.2543	0.0128	NO BIAS	19273	1699	0.0882

CIs %>% filter(!is.na(language), threshold == 0.15) %>% # If you defined multiple thresholds in Step 1, list the one you want to use for analysis here.
  bias_check('fnr_ci', 'language', 'English', fairness_metric='FNR') %>% # If you are using different sensitive variables than we are, change the second value to your first variable of interest; change the third value to its reference group. 
  mutate(label = paste0(as.character(round(prev*100, 1)), '%')) %>%
  ggplot(aes(x=group, y = mean, ymin = lower, ymax = upper, color = group)) + 
  geom_linerange() + geom_point() + 
  theme(axis.line=element_blank(), axis.text.x=element_blank()) + 
  ggtitle("FNRs for language") +
  geom_text(aes(x=group, y=0, label=label), show.legend=FALSE)

EOD_insurance = CIs %>% filter(!is.na(insurance), threshold == 0.15) %>% # If you defined multiple thresholds in Step 1, list the one you want to use for analysis here.
  bias_check('fnr_ci', 'insurance', 'Medicaid', fairness_metric='FNR') %>% # If you are using different sensitive variables than we are, change the second value to your first variable of interest; change the third value to its reference group. 
  mutate(
    five_pp_diff = if_else(group == 'Medicaid', "REF", five_pp_diff) # Replace the value of 'group' with your reference group
  ) %>%  
  select('group','mean','eod','five_pp_diff','total','pos','prev') 
knitr::kable(EOD_insurance, caption = "Equal Opportunity Differences, Insurance", digits=4) %>%
  kableExtra::kable_styling() %>%
  kableExtra::column_spec(2:4, background = "lightgray") %>%
  kableExtra::column_spec(5:7, background = "darkgray")

Equal Opportunity Differences, Insurance

group	mean	eod	five_pp_diff	total	pos	prev
Commercial	0.1969	0.1606	BIAS	10618	838	0.0789
Medicaid	0.3575	0.0000	REF	6619	716	0.1082
Medicare	0.1694	0.1881	BIAS	6524	549	0.0842
No Fault	0.3653	0.0078	NO BIAS	6454	761	0.1179
Other Government	0.1026	0.2549	BIAS	6588	575	0.0873
Self-Pay	0.3077	0.0498	NO BIAS	6538	676	0.1034
Workers Comp	0.1503	0.2073	BIAS	6659	579	0.0869

CIs %>% filter(!is.na(insurance), threshold == 0.15) %>% # If you defined multiple thresholds in Step 1, list the one you want to use for analysis here.
  bias_check('fnr_ci', 'insurance', 'Medicaid', fairness_metric='FNR') %>% # If you are using different sensitive variables than we are, change the second value to your first variable of interest; change the third value to its reference group. 
  mutate(label = paste0(as.character(round(prev*100, 1)), '%')) %>%
  ggplot(aes(x=group, y = mean, ymin = lower, ymax = upper, color = group)) + 
  geom_linerange() + geom_point() + 
  theme(axis.line=element_blank(), axis.text.x=element_blank()) + 
  ggtitle("FNRs for insurance") +
  geom_text(aes(x=group, y=0, label=label), show.legend=FALSE)

Assess Fairness across Classes to Identify “Most Biased” Class

Often, models will show evidence of bias within multiple classes. To start adjusting thresholds, we need to prioritize one class which reports the “most” bias. Looking at a panel of summary measures can highlight overarching patterns to help us decide how to make these decisions. In the following table, we see measures quantifying equal opportunity difference (“disparity”) between subgroups and their associated reference group across our classes of interest. The worst performing class is highlighted in red for each statistic, which are defined as follows:

‘Biggest Abs EOD’ = absolute value of the biggest difference between a subgroup’s FNR and the reference group’s FNR
“Abs Avg EOD” = absolute value of the average of the differences between every subgroup and the reference group’s FNR
“Weighted Abs Avg EOD” = absolute value of the weighted average of the differences between every subgroup and the reference group’s FNR
“Subgroups with >0.05 Abs EOD (%)” = percent of subgroups showing bias, or percent of subgroups with FNR’s >= 0.05 different from the referent’s FNR.

You need to edit the following code chunk per your variables and referents, according to in-line comments:

# Pay attention to the in-line comments in this chunk, as edits are required for your data/use-case: 

disparity_table = tibble(
  class = c("race", "sex", "language", "insurance"), # Edit according to your classes
  big = c(
    (max(abs(EOD_race$eod))), # Edit according to your classes
    (max(abs(EOD_sex$eod))), # Edit according to your classes
    (max(abs(EOD_language$eod))), # Edit according to your classes
    (max(abs(EOD_insurance$eod))) # Edit according to your classes
    ),
  avg = c(
    sum(EOD_race$eod) / (n_distinct(EOD_race$group) - 1),  # Edit according to your classes
    sum(EOD_sex$eod) / (n_distinct(EOD_sex$group) - 1),      # Edit according to your classes
    sum(EOD_language$eod) / (n_distinct(EOD_language$group) - 1),  # Edit according to your classes
    sum(EOD_insurance$eod) / (n_distinct(EOD_insurance$group) - 1)), # Edit according to your classes
  avg_pop_adj = 
    c(
    (EOD_race %>% filter(group != 'Hispanic') %>% summarize(weighted_sum = sum(eod * total), total_sum = sum(total)) %>% mutate(final_result = weighted_sum / total_sum) %>% pull(final_result)), # Replace the value of group != to your reference group for this class
    (EOD_sex %>% filter(group != 'Female') %>% summarize(weighted_sum = sum(eod * total), total_sum = sum(total)) %>% mutate(final_result = weighted_sum / total_sum) %>% pull(final_result)), # Replace the value of group != to your reference group for this class
    (EOD_language %>% filter(group != 'English') %>% summarize(weighted_sum = sum(eod * total), total_sum = sum(total)) %>% mutate(final_result = weighted_sum / total_sum) %>% pull(final_result)), # Replace the value of group != to your reference group for this class
    (EOD_insurance %>% filter(group != 'Medicaid') %>% summarize(weighted_sum = sum(eod * total), total_sum = sum(total)) %>% mutate(final_result = weighted_sum / total_sum) %>% pull(final_result)) # Replace the value of group != to your reference group for this class
    ),
  five_pp = c(
    (nrow(EOD_race %>% filter(five_pp_diff == "BIAS")))/(nrow(EOD_race %>% filter(eod > 0)))*100, # Edit according to your classes
    (nrow(EOD_sex %>% filter(five_pp_diff == "BIAS")))/(nrow(EOD_sex%>% filter(eod > 0)))*100, # Edit according to your classes
    (nrow(EOD_language %>% filter(five_pp_diff == "BIAS")))/(nrow(EOD_language%>% filter(eod > 0)))*100, # Edit according to your classes
    (nrow(EOD_insurance %>% filter(five_pp_diff == "BIAS")))/(nrow(EOD_insurance%>% filter(eod > 0)))*100 # Edit according to your classes
    ) 
)

disparity_table = disparity_table %>% 
  mutate(across(where(is.numeric),~ round(.,3))) %>%
  mutate(across(where(is.numeric), ~ ifelse(. == max(.), paste0("<span style='color: red;'>", ., "</span>"), .)))

knitr::kable(disparity_table, caption = "Equal Opportunity Disparity Table", col.names = c(
 "Class",
 "Biggest Abs EOD",
 "Abs Avg EOD",
 "Weighted Abs Avg EOD",
 "Subgroups with >0.05 Abs EOD (%)"
))

Class	Biggest Abs EOD	Abs Avg EOD	Weighted Abs Avg EOD	Subgroups with >0.05 Abs EOD (%)
race	0.256	0.145	0.152	85.714
sex	0.022	0.022	0.022	0
language	0.015	0.014	0.013	0
insurance	0.255	0.145	0.147	66.667

Equal Opportunity Disparity Table

The class with the biggest burden of bias according to the table should be prioritized for mitigation in Step 3. For us, that’s race/ethnicity. _____________________________________________________________________________

Step 3: Mitigate Bias

Identify Group-Specific Thresholds That Improve Equal Opportunity

Threshold adjustment sets a custom risk score threshold for each group within one class. Our goal is to select the group-specific thresholds which maximize fairness without drastically changing our alert rate or accuracy. To do this, we set the referent group’s FNR as the goal. The code finds the threshold that yields an FNR closest to the referent’s for each other subgroup, to minimize total equal opportunity difference.

You will need to set your class and reference group values where indicated in-line:

# Pay attention to the in-line comments in this chunk, as edits are required for your data/use-case: 

# Function to calculate FNR
calculate_fnr <- function(predicted, actual, threshold) {
  predictions <- ifelse(predicted >= threshold, 1, 0)
  fn <- sum((predictions == 0) & (actual == 1))
  tp_fn <- sum(actual == 1)
  fnr <- fn / tp_fn
  return(fnr)
}

# Function to find the optimal threshold 
find_optimal_threshold <- function(data, reference_fnr, protected_class_value, protected_class_col,whichScore) {
  thresholds <- seq(0, 1, by = 0.01)
  fnr_diffs <- sapply(thresholds, function(threshold) {
    fnr <- calculate_fnr(data$proba[data[[protected_class_col]] == protected_class_value],
                         data$outcome[data[[protected_class_col]] == protected_class_value],
                         threshold)
    
    diff <- if (whichScore == 'closest') {
      abs(fnr - reference_fnr)
    } else if (whichScore == 'lowest' && fnr <= reference_fnr) {
      fnr - reference_fnr
    } else if (whichScore == 'highest' && fnr >= reference_fnr) {
      fnr - reference_fnr
    } else {
      NA_real_
    }
  })
  optimal_threshold <- if (whichScore == 'closest' || whichScore == 'highest') {
    thresholds[which.min(fnr_diffs)]
  } else if (whichScore == 'lowest') {
    thresholds[which.max(fnr_diffs)]
  } else {
    NA_real_
  } 
  return(optimal_threshold)
}

# Set your data
predictive_model_data <- BIA

# Set the protected class column name to the class identified as having the most bias in Step 2. 
protected_class_col <- 'race' 

# Set the baseline threshold value as the threshold for the reference class.
reference_threshold <- .15

# Set the reference group to calculate its FNR
reference_class <- 'Hispanic'

reference_fnr <- calculate_fnr(predictive_model_data$proba[predictive_model_data[[protected_class_col]] == reference_class],predictive_model_data$outcome[predictive_model_data[[protected_class_col]] == reference_class],reference_threshold)  

# Find optimal thresholds for each protected class value
protected_classes <- unique(predictive_model_data[[protected_class_col]])
optimal_thresholds <- sapply(protected_classes, function(class) {
  find_optimal_threshold(predictive_model_data, reference_fnr, class, protected_class_col,'closest')
})
# Print optimal thresholds
knitr::kable(optimal_thresholds, caption = "Subgroup Thresholds")

	x
African American	0.04
Asian	0.35
Hispanic	0.15
Native American	0.27
Something Else	0.13
Two or more races	0.28
Unknown	0.28
White	0.35

Subgroup Thresholds

# Convert optimal_thresholds to a named vector
optimal_thresholds <- setNames(optimal_thresholds, protected_classes)

Update Predictions Using Your Adjusted Threshold to Prepare for Mitigation Assessment

We create an updated binary prediction variable to reflect our updated thresholds. We can think of this as “label flipping,” since some scores previously categorized as low risk (0) are now categorized as high risk (1).

#  Again, you don't need to edit the parameters in this chunk as long as you've maintained our nomenclature, just run it.

BIA <- BIA %>%
  mutate(
    new_label = case_when(
      TRUE ~ ifelse(proba >= optimal_thresholds[!!sym(protected_class_col)], 1, 0)
    )
  ) %>%
  mutate(
    old_label = ifelse(
      proba >= reference_threshold,1,0)
    )

We count our label flips to get a sense of the changes happening.

#  Again, you don't need to edit the parameters in this chunk as long as you've maintained our nomenclature, just run it.

flips <- BIA %>%
  group_by(race) %>%
  summarise(
    count_changes = sum(old_label != new_label),
    one_to_zero = sum(old_label > new_label),
    zero_to_one = sum(old_label < new_label)
  ) %>%
  add_row(
    race = "Total",  # Label for the total row
    count_changes = sum(BIA$old_label != BIA$new_label),
    one_to_zero = sum(BIA$old_label > BIA$new_label),
    zero_to_one = sum(BIA$old_label < BIA$new_label)
  )
knitr::kable(flips)

race	count_changes	one_to_zero	zero_to_one
African American	1169	0	1169
Asian	222	222	0
Hispanic	0	0	0
Native American	301	301	0
Something Else	97	0	97
Two or more races	302	302	0
Unknown	283	283	0
White	502	502	0
Total	2876	1610	1266

Step 4: Assess Mitigation Success

Assess Model Accuracy and Alert Rate with Updated Thresholds

We want to know if our updated thresholds have impacted model accuracy and alert rate. If we reduced bias but no longer offer valuable predictions, we should not implement the adjusted model. Although there are no hard rules, we are hoping to see that our threshold adjustment has not markedly decreased our overall accuracy by more than 5% or changed our alert rate by more than 20%. In some cases, more dramatic changes in accuracy or alert rate may be tolerable to improve fairness.

First, we’ll calculate accuracy.

#  Again, you don't need to edit the parameters in this chunk as long as you've maintained our nomenclature, just run it.

# Function to calculate baseline accuracy score
accuracy_score <- function(outcome, old_label, normalize = TRUE, sample_weight = NULL) {
  # Check if outcome and old_label are the same length
  if (length(outcome) != length(old_label)) {
    stop("outcome and old_label must have the same length")
  }
  if (is.matrix(outcome) || is.matrix(old_label)) {
    differing_labels <- rowSums(outcome != old_label)
    score <- as.numeric(differing_labels == 0)
  } else {
    score <- outcome == old_label
  }
  if (!is.null(sample_weight)) {
    score <- score * sample_weight
    total_weight <- sum(sample_weight)
    return(if (normalize) sum(score) / total_weight else sum(score))
  } else {
    return(if (normalize) mean(score) else sum(score))
  }
}

# Function to calculate adjusted accuracy score
accuracy_score <- function(outcome, new_label, normalize = TRUE, sample_weight = NULL) {
  # Check if outcome and new_label are the same length
  if (length(outcome) != length(new_label)) {
    stop("outcome and new_label must have the same length")
  }
  if (is.matrix(outcome) || is.matrix(new_label)) {
    differing_labels <- rowSums(outcome != new_label)
    score <- as.numeric(differing_labels == 0)
  } else {
    score <- outcome == new_label
  }
  if (!is.null(sample_weight)) {
    score <- score * sample_weight
    total_weight <- sum(sample_weight)
    return(if (normalize) sum(score) / total_weight else sum(score))
  } else {
    return(if (normalize) mean(score) else sum(score))
  }
}

# Use the functions to get accuracy scores
baseline_accuracy = accuracy_score(BIA$outcome, BIA$old_label)
adjusted_accuracy = accuracy_score(BIA$outcome, BIA$new_label)

# Print the accuracy scores
print(paste("Baseline Accuracy Score:", baseline_accuracy))

## [1] "Baseline Accuracy Score: 0.87758"

print(paste("Adjusted Accuracy Score:", adjusted_accuracy))

## [1] "Adjusted Accuracy Score: 0.8791"

Second, we’ll put our accuracy scores in a table and also calculate alert rate changes. You don’t need to edit the following code chunk if your nomenclature has been the same as ours up to this point.

#  Again, you don't need to edit the parameters in this chunk as long as you've maintained our nomenclature, just run it.

alert_baseline = BIA %>% 
        summarize(total = n(), 
                  pp = sum(old_label)) %>% 
        mutate(alert_rate = pp/total)
               
alert_adjusted = BIA %>% 
        summarize(total = n(), 
                  pp = sum(new_label)) %>% 
        mutate(alert_rate = pp/total)

accuracy_alert = tibble(
  model = c("Baseline", "Adjusted", "Percent Change"),
  accuracy = c(baseline_accuracy, 
               adjusted_accuracy, 
               ((adjusted_accuracy - baseline_accuracy)/baseline_accuracy)*100),
  alert_rate = c(alert_baseline$alert_rate, 
                 alert_adjusted$alert_rate,
                (((alert_adjusted$alert_rate - alert_baseline$alert_rate)/alert_baseline$alert_rate))*100))

comparison = accuracy_alert %>%
  mutate(accuracy = round(accuracy,2),
         alert_rate = round(alert_rate, 2))

knitr::kable(comparison, caption = "Accuracy and Alert Rate Changes", col.names = c(
 "",
 "Accuracy",
 "Alert Rate"))  %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE)

Accuracy and Alert Rate Changes

	Accuracy	Alert Rate
Baseline	0.88	0.17
Adjusted	0.88	0.16
Percent Change	0.17	-4.04

Check to see if both changes fall within our acceptable ranges (<10% change in accuracy and <20% change in alert rate). If yes, we can proceed to our second check.

Assess Fairness Improvement within Chosen Class

We want to understand if our updated thresholds improved bias in the class that we identified. We re-generate the equal opportunity difference table we ran in Step 2 with our updated prediction labels.

# Again, you don't need to edit the parameters in this chunk as long as you've maintained our nomenclature, just run it.
adjusted_CIs = lapply(sensitive, function(s){
      loc_ci = BIA %>% 
      group_by_(s) %>% 
      summarize(
                  total = n(), 
                  pos = sum(outcome), neg = total - pos, # actual positive and actual negative
                  pp = sum(new_label), pn = total - pp, # predicted P and N where total = P+N
                  tp = sum((new_label + outcome) == 2), fp = pp - tp, # Predicted P = True + False P
                  tn = sum((new_label + outcome) == 0), fn = pn - tn,
                  neg_check = fp + tn, pos_check = tp + fn
        ) %>% 
        mutate(prev_percent = (round((pos/total) * 100, digits=4)), 
               alert_rate = pp/total, 
               ppv_precision = tp/pp, ppv_ci = binom.confint(tp, pp, method = 'ac')[4:6], 
               npv = tn/pn, 
               tpr_recall_sensitivity = tp/(pos), tpr_ci = binom.confint(tp, pos, method = 'ac')[4:6],
               tnr_specificity = tn/(neg), 
               fpr_type1 = fp/(neg),
               fnr_type2 = fn/(pos), fnr_ci = binom.confint(fn, pos, method = 'ac')[4:6] 
  )
    }) %>% bind_rows()

# Pay attention to the in-line comments in this chunk, as edits are required for your data/use-case: 
adj_EOD_race = adjusted_CIs %>% filter(!is.na(race)) %>%
  bias_check('fnr_ci', 'race', 'Hispanic', fairness_metric='FNR') %>% # If you are using different sensitive variables than we are, change the second value to your first variable of interest; change the third value to its reference group. 
  mutate(
    five_pp_diff = if_else(group == 'Hispanic', "REF", five_pp_diff) # Replace the value of 'group' with your reference group
  ) %>%  
  select('group','mean','eod','five_pp_diff','total','pos','prev') %>%
  rename('fnr' = mean) %>%
    rename('prevalence' = prev) 
knitr::kable(adj_EOD_race, caption = "Adjusted Equal Opportunity Differences, Race", digits=4) %>%
  kableExtra::kable_styling() %>%
  kableExtra::column_spec(2:4, background = "lightgray") %>%
  kableExtra::column_spec(5:7, background = "darkgray")

Adjusted Equal Opportunity Differences, Race

group	fnr	eod	five_pp_diff	total	pos	prev
African American	0.2882	0.0184	NO BIAS	5963	937	0.1571
Asian	0.2600	0.0097	NO BIAS	5619	473	0.0842
Hispanic	0.2697	0.0000	REF	9753	975	0.1000
Native American	0.2652	0.0045	NO BIAS	5631	460	0.0817
Something Else	0.2788	0.0090	NO BIAS	2826	226	0.0800
Two or more races	0.2663	0.0034	NO BIAS	5781	413	0.0714
Unknown	0.2646	0.0051	NO BIAS	5688	393	0.0691
White	0.2754	0.0057	NO BIAS	8739	817	0.0935

adjusted_CIs %>% filter(!is.na(race)) %>% 
  bias_check('fnr_ci', 'race', 'Hispanic', fairness_metric='FNR') %>% # If you are using different sensitive variables than we are, change the second value to your first variable of interest; change the third value to its reference group. 
  mutate(label = paste0(as.character(round(prev*100, 1)), '%')) %>%
  ggplot(aes(x=group, y = mean, ymin = lower, ymax = upper, color = group)) + 
  geom_linerange() + geom_point() + 
  theme(axis.line=element_blank(), axis.text.x=element_blank()) + 
  ggtitle("Adj FNRs for race") +
  geom_text(aes(x=group, y=0, label=label), show.legend=FALSE)

adj_EOD_sex = adjusted_CIs %>% filter(!is.na(sex)) %>% 
  bias_check('fnr_ci', 'sex', 'Female', fairness_metric='FNR') %>% # If you are using different sensitive variables than we are, change the second value to your first variable of interest; change the third value to its reference group. 
    mutate(
    five_pp_diff = if_else(group == 'Female', "REF", five_pp_diff)) %>%  # Replace the value of 'group' with your reference group
    select('group','mean','eod','five_pp_diff','total','pos','prev') %>%
    rename('fnr' = mean) 
knitr::kable(adj_EOD_sex, caption = "Adjusted Equal Opportunity Differences, Sex", digits=4) %>%
  kableExtra::kable_styling() %>%
  kableExtra::column_spec(2:4, background = "lightgray") %>%
  kableExtra::column_spec(5:7, background = "darkgray")

Adjusted Equal Opportunity Differences, Sex

group	fnr	eod	five_pp_diff	total	pos	prev
Female	0.2032	0.0000	REF	25997	2647	0.1018
Male	0.3625	0.1592	BIAS	24003	2047	0.0853

adjusted_CIs %>% filter(!is.na(sex)) %>% 
  bias_check('fnr_ci', 'sex', 'Female', fairness_metric='FNR') %>% # If you are using different sensitive variables than we are, change the second value to your first variable of interest; change the third value to its reference group. 
  mutate(label = paste0(as.character(round(prev*100, 1)), '%')) %>%
  ggplot(aes(x=group, y = mean, ymin = lower, ymax = upper, color = group)) + 
  geom_linerange() + geom_point() + 
  theme(axis.line=element_blank(), axis.text.x=element_blank()) + 
  ggtitle("Adj FNRs for sex") +
  geom_text(aes(x=group, y=0, label=label), show.legend=FALSE)

adj_EOD_language = adjusted_CIs %>% filter(!is.na(language)) %>% 
  bias_check('fnr_ci', 'language', 'English', fairness_metric='FNR') %>% # If you are using different sensitive variables than we are, change the second value to your first variable of interest; change the third value to its reference group. 
  mutate(
    five_pp_diff = if_else(group == 'English', "REF", five_pp_diff) # Replace the value of 'group' with your reference group
  ) %>%  
  select('group','mean','eod','five_pp_diff','total','pos','prev') 
knitr::kable(adj_EOD_language, caption = "Adjusted Equal Opportunity Differences, Language", digits=4) %>%
  kableExtra::kable_styling() %>%
  kableExtra::column_spec(2:4, background = "lightgray") %>%
  kableExtra::column_spec(5:7, background = "darkgray")

Adjusted Equal Opportunity Differences, Language

group	mean	eod	five_pp_diff	total	pos	prev
English	0.2549	0.0000	REF	23841	2381	0.0999
Other	0.2752	0.0203	NO BIAS	6886	614	0.0892
Spanish	0.2966	0.0417	NO BIAS	19273	1699	0.0882

adjusted_CIs %>% filter(!is.na(language)) %>% 
  bias_check('fnr_ci', 'language', 'English', fairness_metric='FNR') %>% # If you are using different sensitive variables than we are, change the second value to your first variable of interest; change the third value to its reference group. 
  mutate(label = paste0(as.character(round(prev*100, 1)), '%')) %>%
  ggplot(aes(x=group, y = mean, ymin = lower, ymax = upper, color = group)) + 
  geom_linerange() + geom_point() + 
  theme(axis.line=element_blank(), axis.text.x=element_blank()) + 
  ggtitle("Adj FNRs for language") +
  geom_text(aes(x=group, y=0, label=label), show.legend=FALSE)

adj_EOD_insurance = adjusted_CIs %>% filter(!is.na(insurance)) %>% 
  bias_check('fnr_ci', 'insurance', 'Medicaid', fairness_metric='FNR') %>% # If you are using different sensitive variables than we are, change the second value to your first variable of interest; change the third value to its reference group. 
  mutate(
    five_pp_diff = if_else(group == 'Medicaid', "REF", five_pp_diff) # Replace the value of 'group' with your reference group
  ) %>%  
  select('group','mean','eod','five_pp_diff','total','pos','prev') 
knitr::kable(adj_EOD_insurance, caption = "Adjusted Equal Opportunity Differences, Insurance", digits=4) %>%
  kableExtra::kable_styling() %>%
  kableExtra::column_spec(2:4, background = "lightgray") %>%
  kableExtra::column_spec(5:7, background = "darkgray")

Adjusted Equal Opportunity Differences, Insurance

group	mean	eod	five_pp_diff	total	pos	prev
Commercial	0.2768	0.0500	NO BIAS	10618	838	0.0789
Medicaid	0.3268	0.0000	REF	6619	716	0.1082
Medicare	0.2441	0.0827	BIAS	6524	549	0.0842
No Fault	0.3968	0.0700	BIAS	6454	761	0.1179
Other Government	0.2052	0.1216	BIAS	6588	575	0.0873
Self-Pay	0.1967	0.1301	BIAS	6538	676	0.1034
Workers Comp	0.2193	0.1075	BIAS	6659	579	0.0869

adjusted_CIs %>% filter(!is.na(insurance)) %>% 
  bias_check('fnr_ci', 'insurance', 'Medicaid', fairness_metric='FNR') %>% # If you are using different sensitive variables than we are, change the second value to your first variable of interest; change the third value to its reference group. 
  mutate(label = paste0(as.character(round(prev*100, 1)), '%')) %>%
  ggplot(aes(x=group, y = mean, ymin = lower, ymax = upper, color = group)) + 
  geom_linerange() + geom_point() + 
  theme(axis.line=element_blank(), axis.text.x=element_blank()) + 
  ggtitle("Adj FNRs for insurance") +
  geom_text(aes(x=group, y=0, label=label), show.legend=FALSE)

Assess Impact of Threshold Adjustment on Fairness Across Classes

Comparing original and updated Disparity Tables shows how threshold adjustment impacted fairness for other classes. Every organization should define success contextually. To catalyze those conversations, we offer one possible set of guidelines in Step 4.3 of the Playbook.

# Pay attention to the in-line comments in this chunk, as edits are required for your data/use-case: 

adj_disparity_table = tibble(
  class = c("race", "sex", "language", "insurance"), # Edit according to your classes
  big = c(
    (max(abs(adj_EOD_race$eod))), # Edit according to your classes
    (max(abs(adj_EOD_sex$eod))), # Edit according to your classes
    (max(abs(adj_EOD_language$eod))), # Edit according to your classes
    (max(abs(adj_EOD_insurance$eod))) # Edit according to your classes
    ),
  avg = c(
    sum(adj_EOD_race$eod) / (n_distinct(adj_EOD_race$group) - 1),  # Edit according to your classes
    sum(adj_EOD_sex$eod) / (n_distinct(adj_EOD_sex$group) - 1),      # Edit according to your classes
    sum(adj_EOD_language$eod) / (n_distinct(adj_EOD_language$group) - 1),  # Edit according to your classes
    sum(adj_EOD_insurance$eod) / (n_distinct(adj_EOD_insurance$group) - 1)), # Edit according to your classes
  avg_pop_adj = 
    c(
    (adj_EOD_race %>% filter(group != 'Hispanic') %>% summarize(weighted_sum = sum(eod * total), total_sum = sum(total)) %>% mutate(final_result = weighted_sum / total_sum) %>% pull(final_result)), # Replace the value of group != to your reference group for this class
    (adj_EOD_sex %>% filter(group != 'Female') %>% summarize(weighted_sum = sum(eod * total), total_sum = sum(total)) %>% mutate(final_result = weighted_sum / total_sum) %>% pull(final_result)), # Replace the value of group != to your reference group for this class
    (adj_EOD_language %>% filter(group != 'English') %>% summarize(weighted_sum = sum(eod * total), total_sum = sum(total)) %>% mutate(final_result = weighted_sum / total_sum) %>% pull(final_result)), # Replace the value of group != to your reference group for this class
    (adj_EOD_insurance %>% filter(group != 'Medicaid') %>% summarize(weighted_sum = sum(eod * total), total_sum = sum(total)) %>% mutate(final_result = weighted_sum / total_sum) %>% pull(final_result)) # Replace the value of group != to your reference group for this class
    ),
  five_pp = c(
    (nrow(adj_EOD_race %>% filter(five_pp_diff == "BIAS")))/(nrow(adj_EOD_race %>% filter(eod > 0)))*100, # Edit according to your classes
    (nrow(adj_EOD_sex %>% filter(five_pp_diff == "BIAS")))/(nrow(adj_EOD_sex%>% filter(eod > 0)))*100, # Edit according to your classes
    (nrow(adj_EOD_language %>% filter(five_pp_diff == "BIAS")))/(nrow(adj_EOD_language%>% filter(eod > 0)))*100, # Edit according to your classes
    (nrow(adj_EOD_insurance %>% filter(five_pp_diff == "BIAS")))/(nrow(adj_EOD_insurance%>% filter(eod > 0)))*100 # Edit according to your classes
    ) 
)

adj_disparity_table = adj_disparity_table %>% 
  mutate(across(where(is.numeric),~ round(.,3))) %>%
  mutate(across(where(is.numeric), ~ ifelse(. == max(.), paste0("<span style='color: red;'>", ., "</span>"), .)))

knitr::kable(adj_disparity_table, caption = "Adjusted Equal Opportunity Disparity Table", col.names = c(
 "Class",
 "Biggest Abs EOD",
 "Abs Avg EOD",
 "Weighted Abs Avg EOD",
 "Subgroups with >0.05 Abs EOD (%)"
))

Class	Biggest Abs EOD	Abs Avg EOD	Weighted Abs Avg EOD	Subgroups with >0.05 Abs EOD (%)
race	0.018	0.008	0.008	0
sex	0.159	0.159	0.159	100
language	0.042	0.031	0.036	0
insurance	0.13	0.094	0.09	83.333

Adjusted Equal Opportunity Disparity Table

While we reduced bias in every column for our target class, we see some increases across other classes.

To interpret these differences and determine the success of mitigation using this table, refer to Section 4.3 of the Playbook.

To implement your new subgroup thresholds in your EHR, refer to Section 5 of the Playbook.

The Playbook can be found as a PDF under "Supplementary information" in our npj Digital Medicine paper: Mackin, S., Major, V.J., Chunara, R. et al. Identifying and mitigating algorithmic bias in the safety net. npj Digit. Med. 8, 335 (2025). https://doi.org/10.1038/s41746-025-01732-w

_____________________________________________________________________________

Suggested Citation: Mackin S, Major VJ, Chunara R, Dickenson A, Lee S, Bocour A, Eisenstein L, Davis N, Newton-Dame R. Mitigating bias in AI algorithms: A healthcare guide to threshold adjustment. New York City: Office of Population Health, NYC Health + Hospitals; 2025.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
Threshold-Adjustment-Playbook_files/figure-gfm		Threshold-Adjustment-Playbook_files/figure-gfm
Tips and Tricks for other methods		Tips and Tricks for other methods
images		images
rsconnect/documents/Threshold Adjustment Playbook.Rmd/posit.nychhc.org/mackins		rsconnect/documents/Threshold Adjustment Playbook.Rmd/posit.nychhc.org/mackins
.gitignore		.gitignore
BIA_Synthetic_Data.csv		BIA_Synthetic_Data.csv
README.md		README.md
Threshold Adjustment Playbook.Rmd		Threshold Adjustment Playbook.Rmd
Threshold-Adjustment-Playbook.md		Threshold-Adjustment-Playbook.md
Thumbs.db		Thumbs.db
nyc-hh_algorithmic-bias-mitigation.Rproj		nyc-hh_algorithmic-bias-mitigation.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Mitigating Bias in AI Algorithms: A Healthcare Guide to Threshold Adjustment

Introduction

Step 1: Prepare

Setup Your Environment

Get Your Data

Clean Your Data

Define Your 'Sensitive' (Sociodemographic) Variables

Establish Thresholds

Step 2: Assess Bias

Probability Distributions

Assess Overall Performance

Assess Overall Model Performance Within Each Class

Assess Fairness Metrics by Class

Assess Fairness across Classes to Identify “Most Biased” Class

Step 3: Mitigate Bias

Identify Group-Specific Thresholds That Improve Equal Opportunity

Update Predictions Using Your Adjusted Threshold to Prepare for Mitigation Assessment

Step 4: Assess Mitigation Success

Assess Model Accuracy and Alert Rate with Updated Thresholds

Assess Fairness Improvement within Chosen Class

Assess Impact of Threshold Adjustment on Fairness Across Classes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

shainamackin/nyc-hh_algorithmic-bias-mitigation

Folders and files

Latest commit

History

Repository files navigation

Mitigating Bias in AI Algorithms: A Healthcare Guide to Threshold Adjustment

Introduction

Step 1: Prepare

Setup Your Environment

Get Your Data

Clean Your Data

Define Your 'Sensitive' (Sociodemographic) Variables

Establish Thresholds

Step 2: Assess Bias

Probability Distributions

Assess Overall Performance

Assess Overall Model Performance Within Each Class

Assess Fairness Metrics by Class

Assess Fairness across Classes to Identify “Most Biased” Class

Step 3: Mitigate Bias

Identify Group-Specific Thresholds That Improve Equal Opportunity

Update Predictions Using Your Adjusted Threshold to Prepare for Mitigation Assessment

Step 4: Assess Mitigation Success

Assess Model Accuracy and Alert Rate with Updated Thresholds

Assess Fairness Improvement within Chosen Class

Assess Impact of Threshold Adjustment on Fairness Across Classes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Packages