Skip to content

ggsurvplot: dollar in level of one variable changes the order in which another variable appears in legend #680

@VincentvdN

Description

@VincentvdN

Situation:

Suppose I compare a new treatment against a standard treatment in patients of which we also know the value of some biomarker. In below fake data we live in an ideal world where the marker gives a very clear answer to which patient to treat with what treatment: in patients with low values of the marker the standard treatment works better than the experimental treatment and in patients with high value of the marker the experimental treatment works better than the standard. Also, in this fake data, all patients with high marker values live longer than all patients with low marker values, just so we can easily tell the groups apart:

fakedata <- data.frame(
  patients = seq(1, 16),
  marker = factor(rep(c("Marker $< 50\\%$", "Marker $\\geq 50\\%$"), each = 8), levels = c("Marker $< 50\\%$", "Marker $\\geq 50\\%$")), $\\geq 50\\%$")),
  arm = factor(rep(c("Experimental", "Standard"), times = 8), levels = c("Standard", "Experimental")),
  time = c(1, 2, 2, 4, 3, 6, 4, 8, 10, 5, 12, 6, 14, 7, 16, 8),
  event = rep(1, 16)
) 

fit <- survfit(Surv(time, event) ~ marker + arm, data = fakedata)

The dollars in the levels of the marker values are there to have them look nice if I make a table of them in Latex. Of course I don't want the dollars to show up in the legend of the plot, so I will use legend.labs = to get better labels. We'll get there, but first a quick check that the fakedata does indeed do what I say: typing fit produces the following table.

Call: survfit(formula = Surv(time, event) ~ marker + arm, data = fakedata)

                                              n events median 0.95LCL 0.95UCL
marker=Marker $< 50\\%$, arm=Standard         4      4    5.0       2      NA
marker=Marker $< 50\\%$, arm=Experimental     4      4    2.5       1      NA
marker=Marker $\\geq 50\\%$, arm=Standard     4      4    6.5       5      NA
marker=Marker $\\geq 50\\%$, arm=Experimental 4      4   13.0      10      NA

We see that indeed people with high marker values live longer than people with low marker values and that in the first group patients with standard treatment live longer and in the second people with experimental treatment. We also note that the order in which the four groups appear in the table is exactly what we expect.

ggsurvplot call:

Now I want to make a plot of the situation. As said above I use legend.labs to get level names without dollars and also without equality signs and variable names:

ggsurvplot(fit, data = fakedata, legend.labs = c("Marker < 50%, Standard arm", "Marker < 50% Experimental arm", "Marker \u2265 50%, Standard arm", "Marker \u2265 50%, Experimental arm"))

Expected behaviour:

Image

A nice plot showing survivals from best to worst in the right order: high marker, experimental treatment best; high marker standard treatment second; low marker standard treatment third; low marker experimental treatment worst.

Actual behaviour:

Image

A highly misleading plot claiming that people with low marker values in the standard arm are doing best and people with high marker values in the standard arm are doing worst. Now here I know something is wrong because I created the data myself. But in a situation with real data, where I use the plot to understand what patient should get what treatment and has what prognosis, this could cause serious trouble.

What is going on here?

We get some clue to what is going on if we leave out the legend.labs= so that we get a chance to see what legend labels ggsurvplot would create itself in this situation:

plotnolabels <- ggsurvplot(fit, data = fakedata) 

attr(plotnolabels$plot, "parameters")$legend.labs

The results are interesting:

[1] "\\geq 50\\%$, arm=Experimental" "\\geq 50\\%$, arm=Standard    "
[3] "< 50\\%$, arm=Experimental"     "< 50\\%$, arm=Standard    "    

We see two things:

  1. the "marker=" part is missing, as well as the beginning of the level name up-to the first dollar.

This is what makes me believe the dollars are the cause of the problem. But what is more worrisome, and the reason I consider this a real bug is

  1. the order of the levels of the other variable, arm is switched.

This is very weird and causes the error in the plot. I guess it is related to this issue that under "normal" circumstances was resolved some time ago: #74

Workaround, or: what code did I use to create the 'expected-behavior'-plot

We can just make sure that there are no dollars that can cause trouble before using ggsurvplot:

fakedata$markerclean <- factor(fakedata$marker, levels = levels(fakedata$marker), labels = c("low", "high"))
table(fakedata[, c("marker", "markerclean")])

fit2 <- survfit(Surv(time, event) ~ markerclean + arm, data = fakedata)

correctplot <- ggsurvplot(fit2, data = fakedata, legend.labs = c("Marker < 50%, Standard arm", "Marker < 50% Experimental arm", "Marker \u2265 50%, Standard arm", "Marker \u2265 50%, Experimental arm"))

However, doing this requires that I would know on forehand that dollars are a problem. It would be safer if this bug is somehow fixed, or if ggsurvplot issues a warning if it encounters dollars or other forbidden characters.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions