Pedigree data in sgkit #786

timothymillar · 2021-12-23T11:14:57Z

timothymillar
Dec 23, 2021
Maintainer

Hi all I'd like to start some discussion on the possibility of supporting pedigree data in sgkit.

Background

A pedigree is essentially a DAG recording the familial relationships between individuals. A key aspect of the pedigree DAG structure is that each node typically only descends from two parent nodes in a sexually reproductive species, or from a single node in a clonal species.

There may be a mixture of clonal and sexual reproduction in some breeding systems. For example, some plant varieties have originated as mutant bud-sports (i.e. a single branch/cane exhibiting a novel phenotype due to somatic mutation).

Selfing is a special case of sexual reproduction that may occur in hermaphroditic species in which a sperm and egg cell from a single parent give rise to a new genotype. This differs from asexual reproduction because the sperm and egg cells both represent a random subset of the parent genome. Hence each gamete may contain identical alleles resulting in a potent form of inbreeding.

File formats

Pedigree data is often stored in simple tabular formats where each row represent a single individual/genotype and the parents of each individual are indicated in a "Father" and "Mother" columns (commonly "Sire" and "Dam" in animal breeding).
The rows of these formats are often sorted by topological order (i.e. no individual appears before its ancestors) and this is expected by some software packages.

The ped format is a loose standard which is primarily used for sexually reproducing dioecious diploid species. This is a tabular (white space separated) format used by gatk, plink and many other packages. There are typically six required columns including:

Family ID
Individual ID
Paternal ID
Maternal ID
Sex (1=male; 2=female; other=unknown)
Phenotype

Asexual reproduction is not explicitly supported by the ped format. In some cases a clone is indicated by a single parent (mother or father), however, this may be ambiguous as it can also indicate an unknown parent. Selfing is often represented in ped format by encoding both the father and mother as the parent plant. This is biologically accurate but may be be misinterpreted as asexual reproduction.

The VCF standard supports pedigree descriptions in the header metadata using the ##PEDIGREE tag. This is a very flexible format which allows for explicit labelling of each relationship type and has semi-standard conventions for sexual reproduction ("Father", "Mother") and asexual reproduction ("Original").

Basic array representation

The simplest way to store pedigree data in an array structure is a matrix of shape (samples * parental edges) i.e. a row for each sample and a "father" and "mother" column. The values of this matrix may be the parent identifiers or the positional index of each parent based on the sample order.

Pedigree using parent identifiers:

sample_id = ["A", "B", "C", "D", "E"]
sample_edges = [
	[".", "."],
	[".", "."],
	[".", "."],
	["A", "B"],
	["A", "C"],
]

Pedigree using parent indices (-1 indicating unknown):

sample_id = ["A", "B", "C", "D", "E"]
sample_edges = [
	[-1, -1],
	[-1, -1],
	[-1, -1],
	[ 0,  1],
	[ 0,  2],
]

Using the positional indices of parents is more efficient but the indices will be invalidated if the sample order changes. I imagine it would be best to store both forms and have a method to validate/update the indices based upon the labels.

A source of ambiguity in this pedigree structure is when a sample has a single parent. Depending on the species or context, this may indicate a missing parent or that the sample is a clone. A workaround would be to have "father", "mother" and "clone" columns though this may complicate IO of ped files. Another issue with missing parents is that analyses have to deal with "partial" founders instead of having a clear distinction between founders and non-founders.

Pedigree kinship calculations

One of the main use cases for working with pedigree data is to estimate (expected) kinship and inbreeding coefficients based on the pedigree structure. Pedigree based relatedness estimates were pioneered by Pearl (1913) and Wright (1922). Gustave Malécot (1948) developed the coefficient of consanguinity (AKA co-ancestry or kinship) which redefined relatedness in terms of probabilities of alleles being identical by decent (IBD).

Kinship is normally calculated by sorting the pedigree topologically and then recursively calculating coefficients between each new individual based on the relationships of its parents and all preceding nodes. There has been quite a bit of research speeding up kinship estimation in large pedigrees e.g.
Kirkpatrick et al (2019).

One of the big issues with pedigree based kinship estimation methods is that they typically assume that founders are unrelated to one another (i.e. kinship = 0 among all founders). This often leads to underestimates of kinship and inbreeding throughout the pedigree. A means to mitigate this issue is to initialise pedigree based estimation by using marker based estimates among the founding individuals. This results in the expected marker kinship among the descendants. I've read a couple of papers that refer to this approach (e.g. Weir and Goudet (2017)) though Meng et al (2019) is the only paper I know of that go into detail (they handle the x-chromosome in addition to the autosome).

Polyploid and mixed-ploidy data

Kerr et al (2012) and Hamilton and Kerr (2017) generalise pedigree based kinship estimation to auto-polyploids in a very general way which can also handle ploidy manipulation. They introduce the parameters 𝜏 and 𝜆 which respectively indicate the number of genome copies inherited from each parent, and the probability that a pair of alleles in a given gamete descend from a single parental allele (due to a complex phenomenon in in auto-polyploids called double-reduction, or due to gamete manipulations).

The 𝜏 parameter is particularly useful for removing ambiguity in pedigree specifications because the 𝜏 values associated with each parent of a given sample should sum to its ploidy. For example, if a single parent is recorded then 𝜏 makes it clear whether the progeny is a clone (𝜏 = ploidy of sample) or if there is a missing parent (𝜏 < ploidy of sample). In theory this could be generalised to a value per contig (or variant) which would distinguish between autosome and sex chromosomes.

The 𝜆 parameter should theoretically vary across the genome in auto-polyploids. It should increase with distance from the centromeres due to double-reduction.

Issues when joining pedigree and marker data

In my experience there are a few common difficulties that arise when combining pedigree and genotype data into a single dataset. The first is simply joining identifiers which often denote a specific sample for genotype data as opposed to an organism in pedigree data.

A more significant issue is dealing with biological replicates of genotypes as there is no perfect way to represent a replicate in a pedigree graph. The best option is to treat replicates as clones, however, this requires deciding on an "original" replicate which the others are clones of. Generally I try to pick a "best" replicate and remove the others, but comparing pedigree and genetic relatedness is a useful method for picking the "best" replicate ... so it's often a chicken and egg situation.

Another issue is dealing with intermediate pedigree nodes for which there is no genetic sample available. Fortunately, Xarray provides some powerful join operations which enable you to insert a genotype with missing allele data (though this may cause issues downstream).

Basic visualisation

I realise that sgkit is not intended to be a visualisation toolkit however I think it is necessary to provide some basic visualisation of pedigree data (if pedigree data is supported). This is essential to check that the pedigree structure is correct (especially after a join or subset). It is quite easy to encode a pedigree as a DAG using graphviz which is already a dependency of sgkit via dask. A simple display_pedigree method would go a long way towards usability.

General graph wrangling

There are some general graph algorithms that are necessary/useful for working with pedigrees. The primary one is topological ordering of pedigree nodes. It's also useful to be able to subset pedigrees based on relationships such as ancestors/descendants. Most of this functionality could be outsourced to networkx.

hammer · 2021-12-23T13:34:34Z

hammer
Dec 23, 2021
Maintainer

Thanks for starting this discussion @timothymillar!

Somewhat related to pedigree is the notion of whether an allele is ancestral vs. derived, discussed at https://github.com/pystatgen/sgkit/discussions/580. I'm linking here just to remind myself of sgkit users who might care about the history of genomes.

0 replies

hammer · 2021-12-23T13:39:30Z

hammer
Dec 23, 2021
Maintainer

For general graph wrangling we've had good luck with NetworkX (cf. https://github.com/related-sciences/nxontology). I think that's a reasonable dependency for us to pick up, as it's just Python. The alternatives, igraph and graph-tool, use C/C++ to improve performance but also complicate build and release.

0 replies

jeromekelleher · 2022-01-05T14:41:09Z

jeromekelleher
Jan 5, 2022
Maintainer

I'm very positive about including pedigree support in sgkit @timothymillar . We've recently added support for pedigrees in tskit, and are currently working on methods to run popgen simulations conditioned on a given pedigree in msprime. Having support for a sensible pedigree representation and conversion utilities from things like (the various forms of) PED would be very helpful when processing pedigrees.

The approach we've taken is to encode pedigree data using the "individual table", where each row corresponds to the information for a particular individual and individuals are referred to by their zero-based integer IDs (very much in the spirit of sgkit). For example, here's a simple simulated Wright Fisher pedigree in an individual table, and the corresponding graph (produced using networkx):

╔══╤═════╤════════╤═══════╤════════╗
║id│flags│location│parents│metadata║
╠══╪═════╪════════╪═══════╪════════╣
║0 │    0│        │   5, 6│     b''║
║1 │    0│        │   7, 8│     b''║
║2 │    0│        │   7, 9│     b''║
║3 │    0│        │   9, 8│     b''║
║4 │    0│        │   5, 9│     b''║
║5 │    0│        │ 13, 11│     b''║
║6 │    0│        │ 12, 13│     b''║
║7 │    0│        │ 10, 11│     b''║
║8 │    0│        │ 12, 12│     b''║
║9 │    0│        │ 11, 11│     b''║
║10│    0│        │ 16, 17│     b''║
║11│    0│        │ 16, 16│     b''║
║12│    0│        │ 17, 18│     b''║
║13│    0│        │ 14, 15│     b''║
║14│    0│        │ -1, -1│     b''║
║15│    0│        │ -1, -1│     b''║
║16│    0│        │ -1, -1│     b''║
║17│    0│        │ -1, -1│     b''║
║18│    0│        │ -1, -1│     b''║
╚══╧═════╧════════╧═══════╧════════╝

(Ignore the flags and location columns, these aren't important).

The way we've encoded the pedigree information is to add a "parents" column to the table. The parents column/array can have any dimension, making things very flexible (actually the column is ragged, so that one row can have 0 parents and others have any number). We found that fixing on a (N, 2) array was more convenient for most purposes. The ID -1 here is used as NULL or "unknown".

What I'd suggest for sgkit pedigree data would be to have something similar. We could have a dataset which at a minimum expects there to be a (N, k) parents array (where k is usually 2, but doesn't have to be), and optionally has fields like sex, phenotype, family_id, individual_id etc.

I think parents is better than mother/father because it's more general, and mother/father is redundant anyway when we specify the sex of the individual.

0 replies

hammer · 2022-01-05T14:57:52Z

hammer
Jan 5, 2022
Maintainer

(for reference tskit-dev/tskit#852 is the tskit pedigree issue)

It seems like this representation would work for us if we don't restrict the parents column to have length 2. That way we can distinguish asexual reproduction from selfing by using 1 parent for the former and 2 identical parents for the latter.

0 replies

timothymillar · 2022-01-06T09:30:46Z

timothymillar
Jan 6, 2022
Maintainer Author

Thanks for the insight @jeromekelleher, I've been looking forward to pedigree based simulations in msprime. Having an optional 'k' is a great idea. I was also imagining a single array (parents is better then sample_edges!) and then mother/father could be used as coordinates along the 'k' dimension if someone wishes.

In your example the sample identifiers are ascending integers and therefore the elements of the parents table can be thought of as IDs or as indices. Did you imagine the parents table in sgkit to contain integer indices or the samples identifiers? My concern with (only) using indices is that dropping a single sample from the dataset can invalidate those indices.
In my current use-case I'm reading genotypes from a VCF, then pedigree data from a database, and joining the two into a single dataset. I typically then subset the dataset to a section of the pedigree which I'm interested in and drop any low quality progeny samples (i.e. leaf nodes). Data-munging is where Xarray really shines so I'd want to be sure our datasets are robust to it.

mother/father is redundant anyway when we specify the sex of the individual.

So long as your species is dioecious!

0 replies

jeromekelleher · 2022-01-06T16:00:09Z

jeromekelleher
Jan 6, 2022
Maintainer

In your example the sample identifiers are ascending integers and therefore the elements of the parents table can be thought of as IDs or as indices. Did you imagine the parents table in sgkit to contain integer indices or the samples identifiers?

Yes, I think it's consistent with our general design philosophy to refer to objects by their integer IDs. For example, we do this for alleles in the genotypes arrays. Generally, integer indexes are much simpler and more useful than (say) string identifiers when using numpy-based APIs. We can also imagine writing algorithms that work with the pedigree information using numba, where having integer indexes as the primary identifier is much simpler and more efficient than reasoning about string identifiers.

Having said that, I totally agree that we would need some way of translating from "external" IDs (like the combined Family ID and Individual ID from plink) to these indexes. I'm imagining that the dataset would contain a string array which contains these "names" which allow us merge on the string IDs. Ideally we'd have some standard array that would correspond to the external ID, but I don't really know what it would be called.

What do you think?

0 replies

timothymillar · 2022-01-06T19:30:50Z

timothymillar
Jan 6, 2022
Maintainer Author

What do you think?

Perfect! Having the indices is a must, but we also need to store the source of those indices.

We also need a name for the 'k' dimension in xarray. So far sgkit uses singular for array names and plural for dimension names so I'd suggest:

Dimension parents whose length is the (maximum) number of parents (typically 2).
Array parent_id with shape (samples, parents) containing string ID's of parent samples matching those in sample_id
Array parent with shape (samples, parents) containing the integer indices of parents in the samples dimension using -1 to indicate missing parents.
Method index_parents to generate array parent from parent_id and sample_id using the rule that any ID not found in sample_id is treated as missing (i.e. results in an index of -1).

Or perhaps array parent_index would be less ambiguous than parent?

Edit:
I think I misunderstood your suggestion for the parent string IDs, is your preference for a 1-dimentional parent_id array?

0 replies

jeromekelleher · 2022-01-07T08:23:57Z

jeromekelleher
Jan 7, 2022
Maintainer

Here's a rough outline of what I'm imagining for a simple trio:

sample_id = ["mom", "dad", "child"]
parents = [
   [-1, -1],
   [-1, -1],
   [0, 1]
]

I've used sample_id here as the string identifier, so that we are consistent with the variant dataset. We want to be able to merge and munge easily with these, so I think it makes sense to follow the terminology as closely as possibly.

Maybe we should see this as a "sample specification" dataset, more generally, where the pedigree information is just one of the columns we can imagine defining? I.e., filling out the details of this nice picture:

I don't see much point in storing the string parent string IDs - can't we just do sample[parents] and get the same thing?

Re singular/plural, I think you're probably right that it should be parent here, following the example of call_genotype etc.

0 replies

timothymillar · 2022-01-07T20:20:33Z

timothymillar
Jan 7, 2022
Maintainer Author

The issue I see with this is that the sample_id array shares the samples dimension with other arrays in a dataset. So dropping a sample or sorting samples for any reason will invalidate the parent indices. If I add the dimensions to your example:

>>> ds["sample_id"] = ["samples"], ["mom", "dad", "child"]
>>> ds["parents"] = ["samples", "parents"], [
...    [-1, -1],
...    [-1, -1],
...    [0,   1]
...]

then drop the second sample

>>> ds = ds.sel(dict(samples=[True, False, True]))

resulting in

>>> ds["sample_id"]
["mom",  "child"]

>>> ds["parents"]
[
   [-1, -1],
   [0,   1]
]

so the child is now recorded as its own parent.

If we repeated that process with a parent_id array we get

>>> ds["parent_id"]
[
   [".", "."],
   ["mom",   "dad"]
]

which can be used with sample_id to regenerate the parent indices as

>>> ds["parents"]
[
   [-1, -1],
   [0,  -1]
]

which reflects that one of the parents is no longer present in the dataset.

Maybe we should see this as a "sample specification" dataset

Agreed, all columns in a ped file map to the samples dimension.

3 replies

jeromekelleher Jan 9, 2022
Maintainer

So dropping a sample or sorting samples for any reason will invalidate the parent indices. If I add the dimensions to your example:

Yes, that's a good point. Isn't this also true for variant data though? We don't use string sample IDs to index into the call_genotype array.

Thinking the workflow through as a combined pedigree+variants dataset is a good idea - what would the subset-of-samples operation look like in both cases?

timothymillar Jan 9, 2022
Maintainer Author

Isn't this also true for variant data

This is technically an issue for the relationship between variant_allele and call_data arrays because the latter contains indices into the former (indices for the alleles dimension). However I think we have been running with the assumption that the alleles dimension is immutable as this could cause numerous issues (e.g. allele frequencies not summing to1).

It's actually more likely to be an issue for the window_start and window_stop arrays which contain indices into the variants dimension.
We should probably document that windowing should only be applied after variant filtering to avoid this issue.

what would the subset-of-samples look like

I imagine that any sub-setting would make use of the xarray methods ds.sel or ds.drop_sel. So sub-setting would involve generating an array of indices to be kept/dropped and then letting xarray apply that selection to the dataset. It's trivial to generate an index of samples for things like genotype quality e.g.

ds2 = ds.sel(dict(
    samples=(ds.call_GQ.mean(dim="variants") > 30).values
))

More complex operations like sub-setting to the ancestors of an individual within a pedigree would require a function that can generate the indices of ancestor nodes. The simplest way to do this is to generate a networkx DiGraph from the parent or parent_id arrays (which are essentially arrays of edges) and then use networkx to generate a list of preceding node indices which can be fed into ds.sel().

I think the key detail here is storing sufficient data to correct indices after their dimension has been altered. More generally I think we need to clearly document methods that generate arrays of indices and encourage users to run them just prior to their use. This is a fundamental issue with storing indices of dimensions that may change, so it requires some guidelines for best practice.

jeromekelleher Jan 10, 2022
Maintainer

SGTM @timothymillar, I think we've got a good sketch here of what the requirements are and the API should look like.

hammer · 2022-01-09T21:16:53Z

hammer
Jan 9, 2022
Maintainer

It's actually more likely to be an issue for the window_start and window_stop arrays which contain indices into the variants dimension.
We should probably document that windowing should only be applied after variant filtering to avoid this issue.

Ah, that's a great point. Could you file an issue?

1 reply

timothymillar Jan 9, 2022
Maintainer Author

Done: #795

Pedigree data in sgkit #786

Uh oh!

timothymillar Dec 23, 2021 Maintainer

Background

File formats

Basic array representation

Pedigree kinship calculations

Polyploid and mixed-ploidy data

Issues when joining pedigree and marker data

Basic visualisation

General graph wrangling

Replies: 10 comments · 4 replies

Uh oh!

hammer Dec 23, 2021 Maintainer

Uh oh!

hammer Dec 23, 2021 Maintainer

Uh oh!

Uh oh!

jeromekelleher Jan 5, 2022 Maintainer

Uh oh!

hammer Jan 5, 2022 Maintainer

Uh oh!

Uh oh!

timothymillar Jan 6, 2022 Maintainer Author

Uh oh!

jeromekelleher Jan 6, 2022 Maintainer

Uh oh!

Uh oh!

timothymillar Jan 6, 2022 Maintainer Author

Uh oh!

jeromekelleher Jan 7, 2022 Maintainer

Uh oh!

timothymillar Jan 7, 2022 Maintainer Author

Uh oh!

jeromekelleher Jan 9, 2022 Maintainer

Uh oh!

Uh oh!

timothymillar Jan 9, 2022 Maintainer Author

Uh oh!

jeromekelleher Jan 10, 2022 Maintainer

Uh oh!

hammer Jan 9, 2022 Maintainer

Uh oh!

timothymillar Jan 9, 2022 Maintainer Author

timothymillar
Dec 23, 2021
Maintainer

Replies: 10 comments 4 replies

hammer
Dec 23, 2021
Maintainer

hammer
Dec 23, 2021
Maintainer

jeromekelleher
Jan 5, 2022
Maintainer

hammer
Jan 5, 2022
Maintainer

timothymillar
Jan 6, 2022
Maintainer Author

jeromekelleher
Jan 6, 2022
Maintainer

timothymillar
Jan 6, 2022
Maintainer Author

jeromekelleher
Jan 7, 2022
Maintainer

timothymillar
Jan 7, 2022
Maintainer Author

jeromekelleher Jan 9, 2022
Maintainer

timothymillar Jan 9, 2022
Maintainer Author

jeromekelleher Jan 10, 2022
Maintainer

hammer
Jan 9, 2022
Maintainer

timothymillar Jan 9, 2022
Maintainer Author