generalise `simulateAlleleFreq()` function to any ploidy

In #5 I  asked about the `bulkSize` option in `runQTLseqAnalysis()` if:

>is it the case that this parameter should be the number of genome copies pooled, which for a diploid it would be 2x number of individuals pooled?

The answer is _no_, `bulkSize` should be the number of diploid individuals. I see this is right, because of the way the null expectation is being simulated.

I guess there are two levels to the simulation, because there are two levels of sampling:

* First we sample individuals from the population and calculate what the alternative allele frequency is (`simulateAlleleFreq()` function).
* Then we sample some reads in the locus (`simulateSNPindex()` function).

I hand't noticed but indeed the first level of the simulation is assuming individuals are diploid, because it samples diploid individual genotypes (`c(0, 0.5, 1)` with probabilities relating to the expected segregation ratios in an F2 (`c(1, 2, 1)/4`).

But what if one is working with higher ploidy? Then the above simulation would not work. 

However, the way it is implement at the moment, I think is equivalent to sampling from a binomial with probability of the event (picking an alternative allele) being 0.5 and number of trials being equal to the number of alleles sampled (2 x number of individuals).

To illustrate with code:

```r
# Define parameters of simulation
nind <- 100  # number of sampled individuals
ploidy <- 2  # assuming diploid
replications <- 1e6  # make a lot of replications also to show how second method is faster

# Current implementation - uses number of sampled individuals
sim1 <- replicate(replications, 
                  mean(sample(c(0, 0.5, 1), size = nind, 
                              prob = c(1, 2, 1)/4, replace = TRUE)))

# Other way - sample alternative alleles with probability 0.5 from n trials given by 
## the number of alleles, which is ploidy * nind
sim2 <- rbinom(replications, ploidy*nind, 0.5)/(ploidy*nind)

# Check they are equivalent
ggplot(data.frame(sim1, sim2)) +
  geom_density(aes(sim1)) +
  geom_density(aes(sim2), colour = "pink", linetype = "dashed")
```

I guess the advantage is that this is general, regardless of the ploidy (besides the bonus of being faster). 

I think the RIL implementation is already general as it is, because in that case we assume the individuals are fully homozygous, in which case they are equivalent to "haploid" organisms. In any case, I think the implementation can be also be made faster, by sampling from a binomial:

```r
# Current implementation
ril1 <- replicate(replications, 
                  mean(sample(
                    x = c(0, 1),
                    size = nind,
                    prob = c(1, 1),
                    replace = TRUE
                  )))


# Other way - null expectation is also allele frequency of 0.5, except now the number of trials
## is the number of individuals (because we can look at them as equivalent to haploid organisms)
ril2 <- rbinom(replications, nind, 0.5)/(nind)

# Check they are equivalent
ggplot(data.frame(ril1, ril2)) +
  geom_density(aes(ril1)) +
  geom_density(aes(ril2), colour = "pink", linetype = "dashed")
```

@bmansfeld please do check all of this, as I might be making some wrong assumption somewhere (I should also probably go and read the Tagaki paper in more detail! :smile:)




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

generalise `simulateAlleleFreq()` function to any ploidy #6

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

generalise simulateAlleleFreq() function to any ploidy #6

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

generalise `simulateAlleleFreq()` function to any ploidy #6