Skip to content

Performance on trees with internal samples #153

@hyanwong

Description

@hyanwong

Chatting to @molpopgen, he says:

When there are sufficient numbers of ancient samples, doing anything with trees is terribly inefficient, and you can recover literally orders of magnitude by simplifying to each time point for which there are ancient samples.
An example that I run into a lot, and I'm sure that @petrelharp has, his simulations where you remember everyone for some period of time.
In those cases, performance regresses from logarithmic to linear, and there's a tremendous amount of time spent updating information about nodes that have nothing to do with your current time slice.
In a simulation, most ancient samples will tend to be internal. And many are not ancestral to the final generation.
Here's a figure I made yesterday based on a massively polygenic simulation. There are millions of internal nodes making up the time series. The plot takes over an hour to make if you don't simplify to each time point separately.

image
20,000 nodes per time point by 100 time points.
The D statistic is calculated from a random sample of 50 diploid individuals. You basically have to simplify to that sample in order for the figure to be possible.
If you have few samples, it's closer to kind of logarithmic. If you have lots, it's quite poor.... like I said. this is an extremely common case. I'm pretty sure that Peter does this routinely. And I certainly do.

This seems like prime material for one of the "High performance" tutorials (see #151 ). There's an open issue on it in molpopgen/fwdpy11#394 but I guess this is a general tree sequence issue and so might well be a candidate for incorporation here

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions