Function to generate synthetic data with similar distributional properties to a real dataset #380
Replies: 5 comments
-
Thanks @adamkucharski, I like the idea, but I think the workflow as a whole is outside the scope of {simulist}. Could you provide an overview of how you see the workflow, with pseudo-code if possible? Then we can sketch out where {simulist} can be enhanced and where other functions/packages are needed. If you also have any datasets that could be used to test this workflow please share links to them if available. |
Beta Was this translation helpful? Give feedback.
-
Some example pseudo-code below, taking a 'true dataset', fitting marginal distributions, then resimulating with these. Would be easier if limited to distributions, as obviously other colums could take lots of forms. # Simulate 'real data'
linelist <- sim_linelist()
# Define columns to match
match_cols <- list(c("age","integer"),
c("case_type","category")
)
# Extract relevant distributions and store...
# Add set up and storage code
for(ii in 1:length(match_cols)){
col_ii <- match_cols[[1]][1] # Column name
type_ii <- match_cols[[1]][2] # Column type
distn <- linelist |> pull(col_ii) # Get values
if(type_ii == "category"){
# ...
}
if(type_ii == "integer"){
# ...
}
}
# Define delays to match
match_delays <- list(c("date_onset","date_admission"),
c("date_onset","date_outcome")
)
# Fit relevant distributions and store...
fit_onset_admission <- NULL
fit_onset_outcome <- NULL
# Simulate synthetic data with matched properties
linelist <- sim_linelist(
population_age = fit_age,
case_type = fit_case_type,
onset_to_hosp = fit_onset_admission,
onset_to_death = fit_onset_outcome
) |
Beta Was this translation helpful? Give feedback.
-
Thanks for the overview, it's would definitely be really neat to be able to seamlessly go from real line list data to synthetic line list data in these steps, with the final step calling However, I don't think this pipeline fits into the scope of the {simulist} package. It would be good to post this onto the Epiverse-TRACE discussion board to get other's thoughts and see what the right format is for such a pipeline (e.g. howto script, blog post, R package, etc.). Let me know if you're happy for me to transfer this issue there. There are some added complications which would need working out in whatever form the pipeline takes. |
Beta Was this translation helpful? Give feedback.
-
@adamkucharski I think the workflow you're describing can potentially be achieved with EpiNow2's forecast_infections function. It allows you to generate a posterior from given data and use that to generate a new dataset by altering the Rt trajectory. It generates a time series that we can then expand to a linelist. What would be needed there is a function that takes a time series and expands it into individual cases with more linelist-like data. |
Beta Was this translation helpful? Give feedback.
-
This has been transferred from a {simulist} issue to open up discussion to the wider Epiverse community, as it is beyond the scope of {simulist}. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
A topic that has come up in discussions with applied partners is the value of being able to generate synthetic data with similar properties to a real - but sensitive, so not shareable – dataset, to allow external groups to develop and test methods.
{simulist} already has the ability to generate simulated data from defined distributions, so addressing the above need would require us to define a new function that could:
As an example, this contact tracing analysis use synthetic-but-realistic marginal distributions for contacts in different settings, rather than publishing the full (and hence more sensitive) joint dataset: https://github.com/adamkucharski/2020-cov-tracing
Beta Was this translation helpful? Give feedback.
All reactions