Skip to content

geom_flow() Fails with "Data is not in a recognized alluvial form" Despite Valid Data Structure #140

@abiyug

Description

@abiyug

Description:
I’ve encountered a persistent error when using ggalluvial’s geom_flow() to create an alluvial plot: Error in geom_flow(): Data is not in a recognized alluvial form. The error occurs even with a minimal, valid data.frame that matches the structure of the Titanic dataset (which works with geom_flow in the package examples). The data passes is_alluvia_form(), suggesting the issue lies in geom_flow()’s setup_data() function. This appears to be a bug in ggalluvial, as the same error occurs with both the CRAN version and the development version of the package.

Steps to Reproduce:

library(ggalluvial)
library(ggplot2)

# Minimal data: 2 countries, 2 years, 2 energy types
test_data <- data.frame(
  country = rep(c("Egypt", "Kenya"), each = 4),
  year = factor(rep(c("2000", "2001"), times = 4)),
  energy_type = factor(rep(c("non_pollutant", "pollutant"), times = 4)),
  demand = c(13.7, 60.77, 15.2, 64.17, 1.31, 2.04, 2.38, 1.53)
)

# Verify structure
str(test_data)
class(test_data)  # Confirms "data.frame"
is_alluvia_form(test_data, axes = "year", id = "country", weight = "demand")  # Should return TRUE
  • Attempt to create an alluvial plot with geom_flow():
ggplot(test_data,
       aes(x = year,
           y = demand,
           stratum = energy_type,
           alluvium = country,
           fill = energy_type)) +
  geom_flow(stat = "flow") +
  geom_stratum(width = 0.2) +
  scale_x_discrete(expand = c(0.1, 0.1)) +
  theme_minimal()

Expected Behavior:
The plot should render an alluvial diagram, with flows showing how each country’s demand splits between non_pollutant and pollutant energy types across the years 2000 and 2001. The Titanic dataset from the ggalluvial examples works with this setup, and test_data is structured similarly (data.frame with factors for axes and numeric weights).

Actual Behavior:
The plot fails with the following error:

Error in `geom_flow()`:
! Problem while computing stat.
ℹ Error occurred in the 1st layer.
Caused by error in `setup_data()`:
! Data is not in a recognized alluvial form (see `help('alluvial-data')` for details).


Traceback:

rlang::last_trace()
<error/rlang_error>
Error in `geom_flow()`:
! Problem while computing stat.
ℹ Error occurred in the 1st layer.
Caused by error in `setup_data()`:
! Data is not in a recognized alluvial form (see `help('alluvial-data')` for details).
---
Backtrace:
     ▆
  1. ├─base (local) `<fn>`(x)
  2. └─ggplot2:::print.ggplot(x)
  3.   ├─ggplot2::ggplot_build(x)
  4.   └─ggplot2:::ggplot_build.ggplot(x)
  5.     └─ggplot2:::by_layer(...)
  6.       ├─rlang::try_fetch(...)
  7.       │ ├─base::tryCatch(...)
  8.       │ │ └─base (local) tryCatchList(expr, classes, parentenv, handlers)
  9.       │ │   └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
 10.       │ │     └─base (local) doTryCatch(return(expr), name, parentenv, handler)
 11.       │ └─base::withCallingHandlers(...)
 12.       └─ggplot2 (local) f(l = layers[[i]], d = data[[i]])
 13.         └─l$compute_statistic(d, layout)
 14.           └─ggplot2 (local) compute_statistic(..., self = self)
 15.             └─self$stat$setup_data(data, self$computed_stat_params)
 16.               └─ggalluvial (local) setup_data(...)
 17.                 └─base::stop("Data is not in a recognized alluvial form ", "(see `help('alluvial-data')` for details).")
Run rlang::last_trace(drop = FALSE) to see 5 hidden frames.

Environment:

  • R version: [Run R.version.string to get this, e.g., "R version 4.3.1 (2023-06-16)"]
  • ggalluvial version: [Run packageVersion("ggalluvial"), e.g., 0.12.5 or dev version]
  • Other packages loaded: [Run sessionInfo() and include relevant details, e.g., ggplot2 version]
  • Operating System: [Specify your OS, e.g., Windows 11, macOS Ventura 13.5, etc.]

Additional Context:
The data structure matches the requirements for geom_flow():

  • country (factor): Alluvium identifier (like individuals in the Titanic dataset).
  • year (factor): X-axis for time steps.
  • energy_type (factor): Stratum for each time step.
  • demand (numeric): Weight for the flows.
  • is_alluvia_form(test_data, axes = "year", id = "country", weight = "demand") returns TRUE, indicating the data is in a valid alluvial form.
  • The same error occurs with a larger dataset (4 countries, 5 years) and persists even after converting from tibble to data.frame, subsetting, and testing with the development version of ggalluvial.
  • The Titanic dataset from the ggalluvial examples works, but custom data with a similar structure fails.

Minimal Reproducible Example (Reprex):
Here’s a self-contained reprex to reproduce the issue:

# Load packages
library(ggalluvial)
library(ggplot2)

# Create minimal data
test_data <- data.frame(
  country = rep(c("Egypt", "Kenya"), each = 4),
  year = factor(rep(c("2000", "2001"), times = 4)),
  energy_type = factor(rep(c("non_pollutant", "pollutant"), times = 4)),
  demand = c(13.7, 60.77, 15.2, 64.17, 1.31, 2.04, 2.38, 1.53)
)

# Verify data
str(test_data)
# 'data.frame':	8 obs. of  4 variables:
#  $ country    : chr  "Egypt" "Egypt" "Egypt" "Egypt" ...
#  $ year       : Factor w/ 2 levels "2000","2001": 1 1 2 2 1 1 2 2
#  $ energy_type: Factor w/ 2 levels "non_pollutant",..: 1 2 1 2 1 2 1 2
#  $ demand     : num  13.7 60.77 15.2 64.17 1.31 2.04 2.38 1.53

class(test_data)  # "data.frame"
is_alluvia_form(test_data, axes = "year", id = "country", weight = "demand")  # TRUE

# Attempt to plot
ggplot(test_data,
       aes(x = year,
           y = demand,
           stratum = energy_type,
           alluvium = country,
           fill = energy_type)) +
  geom_flow(stat = "flow") +
  geom_stratum(width = 0.2) +
  scale_x_discrete(expand = c(0.1, 0.1)) +
  theme_minimal()

Workaround:
As a workaround, I switched to the networkD3 package, which successfully created an interactive Sankey diagram with the same data. However, geom_alluvium() from ggalluvial might work as an alternative within the package (I haven’t tested this yet due to time constraints).

Suggested Fix:

  • Investigate setup_data() in ggalluvial to identify why it rejects valid data (passing is_alluvia_form()).
  • Check for potential issues with factor levels, numeric weights, or internal assumptions about data structure.
  • Ensure compatibility with plain data.frame inputs, as the Titanic dataset works but custom data fails.

Additional Notes:
I also encountered a similar issue with ggsankey’s geom_sankey(), suggesting a broader problem with alluvial/Sankey implementations in R. I’ll file a separate bug report for ggsankey later. For now, focusing on ggalluvial, this bug significantly impacts usability for custom datasets, and I’d appreciate any guidance or fixes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions