Function to generate synthetic data with similar distributional properties to a real dataset #380

adamkucharski · 2025-02-04T12:20:45Z

adamkucharski
Feb 4, 2025
Maintainer

A topic that has come up in discussions with applied partners is the value of being able to generate synthetic data with similar properties to a real - but sensitive, so not shareable – dataset, to allow external groups to develop and test methods.

{simulist} already has the ability to generate simulated data from defined distributions, so addressing the above need would require us to define a new function that could:

Estimate key distributions in a line list (e.g. key delays, perhaps with a simple omission of recent data points to avoid truncation/censoring issues; proportion with key outcomes; demographic distribution; secondary case distribution and contact distribution)
Output an object that could then be used as an input to the existing simulist pipeline.

As an example, this contact tracing analysis use synthetic-but-realistic marginal distributions for contacts in different settings, rather than publishing the full (and hence more sensitive) joint dataset: https://github.com/adamkucharski/2020-cov-tracing

joshwlambert · 2025-02-04T14:03:23Z

joshwlambert
Feb 4, 2025
Collaborator

Thanks @adamkucharski, I like the idea, but I think the workflow as a whole is outside the scope of {simulist}.

Could you provide an overview of how you see the workflow, with pseudo-code if possible?

Then we can sketch out where {simulist} can be enhanced and where other functions/packages are needed. If you also have any datasets that could be used to test this workflow please share links to them if available.

0 replies

adamkucharski · 2025-02-20T08:43:41Z

adamkucharski
Feb 20, 2025
Maintainer Author

Some example pseudo-code below, taking a 'true dataset', fitting marginal distributions, then resimulating with these. Would be easier if limited to distributions, as obviously other colums could take lots of forms.

# Simulate 'real data'
linelist <- sim_linelist()

# Define columns to match
match_cols <- list(c("age","integer"),
                   c("case_type","category")
)

# Extract relevant distributions and store...

# Add set up and storage code

for(ii in 1:length(match_cols)){
  col_ii <- match_cols[[1]][1] # Column name
  type_ii <- match_cols[[1]][2] # Column type
  
  distn <- linelist |> pull(col_ii) # Get values
  
  if(type_ii == "category"){
    # ...
  }
  
  if(type_ii == "integer"){
    # ...
  }
  
}

# Define delays to match
match_delays <- list(c("date_onset","date_admission"),
                     c("date_onset","date_outcome")
)

# Fit relevant distributions and store...
fit_onset_admission <- NULL
fit_onset_outcome <- NULL

# Simulate synthetic data with matched properties

linelist <- sim_linelist(
  population_age = fit_age,
  case_type = fit_case_type,
  onset_to_hosp = fit_onset_admission,
  onset_to_death = fit_onset_outcome
)

0 replies

joshwlambert · 2025-02-20T15:55:20Z

joshwlambert
Feb 20, 2025
Collaborator

Thanks for the overview, it's would definitely be really neat to be able to seamlessly go from real line list data to synthetic line list data in these steps, with the final step calling sim_linelist().

However, I don't think this pipeline fits into the scope of the {simulist} package. It would be good to post this onto the Epiverse-TRACE discussion board to get other's thoughts and see what the right format is for such a pipeline (e.g. howto script, blog post, R package, etc.). Let me know if you're happy for me to transfer this issue there.

There are some added complications which would need working out in whatever form the pipeline takes. sim_linelist() produces a fixed set of line list columns, and therefore the output would be independent of the real line list being mocked. If we wanted the synthetic line list to match the columns of the original line list, this would require extra steps. Additionally, depending on how close we would want the synthetic line list to match the original, we'd need to add some conditioning to the simulation (either internally to {simulist} or externally).

0 replies

jamesmbaazam · 2025-03-12T11:19:38Z

jamesmbaazam
Mar 12, 2025
Collaborator

@adamkucharski I think the workflow you're describing can potentially be achieved with EpiNow2's forecast_infections function. It allows you to generate a posterior from given data and use that to generate a new dataset by altering the Rt trajectory. It generates a time series that we can then expand to a linelist.

What would be needed there is a function that takes a time series and expands it into individual cases with more linelist-like data.

0 replies

joshwlambert · 2025-03-25T11:45:45Z

joshwlambert
Mar 25, 2025
Collaborator

This has been transferred from a {simulist} issue to open up discussion to the wider Epiverse community, as it is beyond the scope of {simulist}.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epiverse-TRACE

Function to generate synthetic data with similar distributional properties to a real dataset #380

Uh oh!

{{title}}

Uh oh!

Replies: 5 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Epiverse-TRACE

Function to generate synthetic data with similar distributional properties to a real dataset #380

Uh oh!

adamkucharski Feb 4, 2025 Maintainer

Replies: 5 comments

Uh oh!

joshwlambert Feb 4, 2025 Collaborator

Uh oh!

adamkucharski Feb 20, 2025 Maintainer Author

Uh oh!

joshwlambert Feb 20, 2025 Collaborator

Uh oh!

jamesmbaazam Mar 12, 2025 Collaborator

Uh oh!

joshwlambert Mar 25, 2025 Collaborator

adamkucharski
Feb 4, 2025
Maintainer

joshwlambert
Feb 4, 2025
Collaborator

adamkucharski
Feb 20, 2025
Maintainer Author

joshwlambert
Feb 20, 2025
Collaborator

jamesmbaazam
Mar 12, 2025
Collaborator

joshwlambert
Mar 25, 2025
Collaborator