-
Notifications
You must be signed in to change notification settings - Fork 0
Module : DGE with SCDE
This module calculate for each gene the probability of differential expression.
-
Internal name : scde
-
Avalaible : local mode
-
Input Ports :
- matrix : filtered count matrix (tsv)
- cells : filtered cells metadata (tsv)
- genes : genes metadata (tsv)
-
Output Ports :
- dgeoutput : differential expression result (tsv)
-
Optional parameters :
- Main Parameters
Parameter | Type | Description | Default Value |
---|---|---|---|
n.cores | int | Number of cores to use | 1 |
model.group.col | string | Name of the column if cells must be grouped for model fitting (NULL if no grouping is necessary) | NULL |
prior.length | int | Number of points for prior calculation | 400 |
batch.col | string | Name of the column indicating batches, if batch correction is required (else NULL) | NULL |
n.randomizations | int | Number of randomization for testing | 150 |
- Parameters for model fitting
Parameter | Type | Description | Default Value |
---|---|---|---|
min.observation | int | Minimal number of observations for a gene to be used for model fitting | 3 |
min.genes | int | Minimum number of genes for model fitting | 2000 |
threshold.segmentation | boolean | Use or not threshold segmentation to accelerate failure estimation | TRUE |
failure.threshold | int | Number of reads indicating a gene failed amplification | 4 |
max.pairs | int | Maximum number of comparisons that should be performed per group for estimation of dropout rate | 5000 |
min.pairs | int | Minimum number of comparisons that should be performed per group for estimation of dropout rate | 10 |
poisson.param | float | Parameter of the Poisson distribution used to model failures | 0.1 |
linear.fit | boolean | Weither to use linear fit for model fitting (highly recommanded) | TRUE |
min.theta | float | Minimum for the dispersion parameter of the negative binomial | 0.01 |
max.theta | float | Maximum for the dispersion parameter of the negative binomial | 100 |
- Parameters for prior calculation
Parameter | Type | Description | Default Value |
---|---|---|---|
save.prior.plot | boolean | Weither to save or not prior plot | TRUE |
pseudocount | int | Pseudocount to add to observation before log transforming them | 1 |
quantile | float | Quantile used to set maximum expression value | 0,999 |
max.value | float | Alternative to quantile, maximum expression value | NULL |
- Parameters for test
Parameter | Type | Description | Default Value |
---|---|---|---|
return.posteriors | boolean | Weither to return or not the posteriors | TRUE |
- Configuration example
<step id="DGE" skip="false">
<module>scde</module>
<parameters>
<parameter>
<name>prior.length</name>
<value>400</value>
</parameter>
<parameter>
<name>n.randomizations</name>
<value>200</value>
</parameter>
<parameter>
<name>n.cores</name>
<value>12</value>
</parameter>
</parameters>
</step>
SCDE uses a mixture model : one distribution is for amplification failures, the other is for success ("real gene expression").
First step of the algorithm is to fit the model. Briefly, this steps estimate failure by comparing gene expression between cells, and then use success to fit a negative binomial (thus setting mean and dispersion parameter).
Second step calculates a prior distribution on mean expression, using all genes and all cells.
Third and last step realize a log fold change Bayesian testing. Its results are gathered in the output file. In this file :
- lb column is the lower bound of the 95% HDI (analogous to 95% CI)
- mle is the maximum likelihood estimate of the fold change
- ub is the upper bound of the 95% HDI (analogous to 95% CI)
- ce is the bound closer to 0, thus it is a kind of "corrected" estimate
- Z is the quantile in a norm Gaussian distribution associated with the probability of the ratio being 0
- cZ is the same as Z after correction by benjamini-hochberg procedure
- cProbability is the probability associated to cZ
In order to evaluate goodness of fit of the model, the module produces several plot for each cell :
From left to rigth plots are :
- scatter plot of observed count as a function of expected count
- same plot with correlation and failures according to the model and 95% CI as a dashed line
- probability of failure as a function of expected count
- fit of the dispersion as a function of count
The easier way to find a badly fitted model is without a doubt to use the probability of failure plot :
One of the model hypothesis is that failure is more likely to occur with few transcripts. So we expect to see high probabilities on the left of the plot, and low probabilities on its right. The left plot shows this kind of distribution, but the right one is nearly uniform indicating a problem with model fitting.
This module requires a v2 design file. The design file must contain experiments with the columns attribute, indicating required columns for grouping the cells, and a comparisons attribute, listing and naming required comparisons.
The module can handle model attribute as for DESeq2, however it will not build a GLM, therefore it will uniquely test factors separately or one interaction of factors.
[Header]
DesignFormatVersion=2
Project=designExample
ProjectDescription=Example of design file for single cell DGE analysis
Owner=Geoffray Brelurut
GenomeFile=genome://mm10
GffFile=gff://mm10
[Experiments]
Exp.1.name=exp1
Exp.1.skip=False
Exp.1.columns=type+day
Exp.1.comparisons=WT1_vs_KO1:typeWT%dayday1_vs_typeKO%dayday1;\
WT2_vs_KO2:typeWT%dayday2_vs_typeKO%dayday2
Exp.2.name=exp2
Exp.2.skip=false
Exp.2.model=~type+day+type:day
Exp.2.comparisons=WT1_vs_KO1:typeWT%dayday1_vs_typeKO%dayday1;\
WT2_vs_KO2:typeWT%dayday2_vs_typeKO%dayday2
[Columns]
SampleId SampleName Reads Condition Reference Exp.1.RepTechGroup Exp.2.type Exp.2.day Exp.2.RepTechGroup Exp.3.Condition Exp.3.RepTechGroup
s1a Sample1a sample1a.fastq WT-day1 1 WT-day1 WT day1 WT-day1 WT WT-day1
s1b Sample1b sample1a.fastq WT-day1 0 WT-day1 WT day1 WT-day1 WT WT-day1
s2a Sample2a sample2a.fastq WT-day2 2 WT-day2 WT day2 WT-day2 WT WT-day2
s2b Sample2b sample2b.fastq WT-day2 0 WT-day2 WT day2 WT-day2 WT WT-day2
s3a Sample3a sample3a.fastq KO-day1 3 KO-day1 KO day1 KO-day1 KO KO-day1
s3b Sample3b sample3b.fastq KO-day1 0 KO-day1 KO day1 KO-day1 KO KO-day1
s4a Sample4a sample4a.fastq KO-day2 0 KO-day2 KO day2 KO-day2 KO KO-day2
s4b Sample4b sample4b.fastq KO-day2 0 KO-day2 KO day2 KO-day2 KO KO-day2