Skip to content

Module : DGE with SCDE

GBrelurut edited this page Jan 5, 2018 · 4 revisions

Module : DGE with SCDE

This module calculate for each gene the probability of differential expression.

  • Internal name : scde

  • Avalaible : local mode

  • Input Ports :

    • matrix : filtered count matrix (tsv)
    • cells : filtered cells metadata (tsv)
    • genes : genes metadata (tsv)
  • Output Ports :

    • dgeoutput : differential expression result (tsv)
  • Optional parameters :

    • Main Parameters
Parameter Type Description Default Value
n.cores int Number of cores to use 1
model.group.col string Name of the column if cells must be grouped for model fitting (NULL if no grouping is necessary) NULL
prior.length int Number of points for prior calculation 400
batch.col string Name of the column indicating batches, if batch correction is required (else NULL) NULL
n.randomizations int Number of randomization for testing 150
  • Parameters for model fitting
Parameter Type Description Default Value
min.observation int Minimal number of observations for a gene to be used for model fitting 3
min.genes int Minimum number of genes for model fitting 2000
threshold.segmentation boolean Use or not threshold segmentation to accelerate failure estimation TRUE
failure.threshold int Number of reads indicating a gene failed amplification 4
max.pairs int Maximum number of comparisons that should be performed per group for estimation of dropout rate 5000
min.pairs int Minimum number of comparisons that should be performed per group for estimation of dropout rate 10
poisson.param float Parameter of the Poisson distribution used to model failures 0.1
linear.fit boolean Weither to use linear fit for model fitting (highly recommanded) TRUE
min.theta float Minimum for the dispersion parameter of the negative binomial 0.01
max.theta float Maximum for the dispersion parameter of the negative binomial 100
  • Parameters for prior calculation
Parameter Type Description Default Value
save.prior.plot boolean Weither to save or not prior plot TRUE
pseudocount int Pseudocount to add to observation before log transforming them 1
quantile float Quantile used to set maximum expression value 0,999
max.value float Alternative to quantile, maximum expression value NULL
  • Parameters for test
Parameter Type Description Default Value
return.posteriors boolean Weither to return or not the posteriors TRUE
  • Configuration example
<step id="DGE" skip="false">
	<module>scde</module>
	<parameters>
		<parameter>
			<name>prior.length</name>
			<value>400</value>	
		</parameter>
		<parameter>
			<name>n.randomizations</name>
			<value>200</value>	
		</parameter>
		<parameter>
			<name>n.cores</name>
			<value>12</value>	
		</parameter>
	  </parameters>
</step>

Interpreting output files

Introduction to scde

SCDE uses a mixture model : one distribution is for amplification failures, the other is for success ("real gene expression").

First step of the algorithm is to fit the model. Briefly, this steps estimate failure by comparing gene expression between cells, and then use success to fit a negative binomial (thus setting mean and dispersion parameter).

Second step calculates a prior distribution on mean expression, using all genes and all cells.

Third and last step realize a log fold change Bayesian testing. Its results are gathered in the output file. In this file :

  • lb column is the lower bound of the 95% HDI (analogous to 95% CI)
  • mle is the maximum likelihood estimate of the fold change
  • ub is the upper bound of the 95% HDI (analogous to 95% CI)
  • ce is the bound closer to 0, thus it is a kind of "corrected" estimate
  • Z is the quantile in a norm Gaussian distribution associated with the probability of the ratio being 0
  • cZ is the same as Z after correction by benjamini-hochberg procedure
  • cProbability is the probability associated to cZ

Evaluation of model fitting and misfit removal

In order to evaluate goodness of fit of the model, the module produces several plot for each cell :

FittingPlot

From left to rigth plots are :

  • scatter plot of observed count as a function of expected count
  • same plot with correlation and failures according to the model and 95% CI as a dashed line
  • probability of failure as a function of expected count
  • fit of the dispersion as a function of count

The easier way to find a badly fitted model is without a doubt to use the probability of failure plot :

fail

One of the model hypothesis is that failure is more likely to occur with few transcripts. So we expect to see high probabilities on the left of the plot, and low probabilities on its right. The left plot shows this kind of distribution, but the right one is nearly uniform indicating a problem with model fitting.

Clone this wiki locally