diff --git a/src/metrics/average_batch_r2/config.vsh.yaml b/src/metrics/average_batch_r2/config.vsh.yaml index 348cded..3acdfd3 100644 --- a/src/metrics/average_batch_r2/config.vsh.yaml +++ b/src/metrics/average_batch_r2/config.vsh.yaml @@ -8,20 +8,64 @@ info: metrics: # A unique identifier for your metric (required). # Can contain only lowercase letters or underscores. - - name: average_batch_r2 + - name: average_batch_r2_global # A relatively short label, used when rendering visualisarions (required) - label: Average Batch R-squared ($\overline{R^2_B}$) + label: Average Batch R-squared Global # A one sentence summary of how this metric works (required). Used when # rendering summary tables. summary: "The average batch R-squared quantifies, on average, how strongly the batch variable B explains the variance in the data." # A multi-line description of how this component works (required). Used # when rendering reference documentation. description: | - First, a simple linear model `sklearn.linear_model.LinearRegression` is fitted for each paired sample, marker (and cell type) to determine the fraction of variance (R^2) explained by the batch covariate B. | - The average batch R_squared is then computed as the average of the $R^2$ values across all paired samples, markers (and cell types). | - As a result, $\overline{R^2_B}$ quantifies how much of the total variability in the data is driven by batch effects. Consequently, a lower values are desirable. | + First, a simple linear model `sklearn.linear_model.LinearRegression` is fitted for each paired sample and marker to determine the fraction of variance (R^2) explained by the batch covariate B. | + The average batch R_squared is then computed as the average of the $R^2$ values across all paired samples, markers. | + As a result, $\overline{R^2_B}_{global}$ quantifies how much of the total variability in the data is driven by batch effects. Consequently, lower values are desirable. | - $\overline{R^2_B} \text{} = \frac{1}{N*C*M}\sum_{\substack{(x_{\mathrm{int}},\,x_{\mathrm{val}})\\ \text{paired samples}}}^{N} \sum_{j=1}^{C} \sum_{i=1}^{M}\,R^2\!\bigl(\mathrm{marker}_i \mid B\bigr)$ + $\overline{R^2_B}_{global} = \frac{1}{N*M}\sum_{\substack{(x_{\mathrm{int}},\,x_{\mathrm{val}})\\ \text{paired samples}}}^{N} \sum_{i=1}^{M} \,R^2\!\bigl(\mathrm{marker}_i \mid B\bigr)$ + + Where: + - $N$ is the number of paired samples, where x_{\mathrm{int}} is the replicate that has been batch-corrected and x_{\mathrm{val}} is replicate used for validation. Paired samples belong to different batches. + - $M$ is the number of markers + - $B$ is the batch covariate + + A higher value of $\overline{R^2_B}_{global}$ indicates that the batch variable explains more of the variance in the data, which indicates a higher level of batch effects. | + + + references: + bibtex: + - | + @book{draper1998applied, + title={Applied regression analysis}, + author={Draper, Norman R and Smith, Harry}, + publisher={John Wiley \& Sons} + } + links: + # URL to the documentation for this metric (required). + documentation: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html + # URL to the code repository for this metric (required). + repository: https://github.com/scikit-learn/scikit-learn + # The minimum possible value for this metric (required) + min: -0.001 + # The maximum possible value for this metric (required) + max: 1 + # Whether a higher value represents a 'better' solution (required) + maximize: false + + + - name: average_batch_r2_ct + # A relatively short label, used when rendering visualisarions (required) + label: Average Batch R-squared Cell Type + # A one sentence summary of how this metric works (required). Used when + # rendering summary tables. + summary: "The average batch R-squared Cell Type quantifies, on average, how strongly the batch variable B explains the variance in the data (by taking into account cell type effect)." + # A multi-line description of how this component works (required). Used + # when rendering reference documentation. + description: | + First, a simple linear model `sklearn.linear_model.LinearRegression` is fitted for each paired sample, marker and cell type to determine the fraction of variance (R^2) explained by the batch covariate B. | + The average batch R_squared is then computed as the average of the $R^2$ values across all paired samples, markers and cell types. | + As a result, $\overline{R^2_B}_{cell\ type}$ quantifies how much of the total variability in the data is driven by batch effects. Consequently, lower values are desirable. | + + $\overline{R^2_B}_{cell\ type} = \frac{1}{N*C*M}\sum_{\substack{(x_{\mathrm{int}},\,x_{\mathrm{val}})\\ \text{paired samples}}}^{N} \sum_{j=1}^{C} \sum_{i=1}^{M}\,R^2\!\bigl(\mathrm{marker}_i \mid B\bigr)$ Where: - $N$ is the number of paired samples, where x_{\mathrm{int}} is the replicate that has been batch-corrected and x_{\mathrm{val}} is replicate used for validation. Paired samples belong to different batches. @@ -31,11 +75,9 @@ info: The $\overline{Rˆ2_B}_{global}$ is a variation of the latter metric, where the average is computed across paired samples and markers only, without taking into account the cell types. | - $\overline{R^2_B}_{global} = \frac{1}{N*M}\sum_{\substack{(x_{\mathrm{int}},\,x_{\mathrm{val}})\\ \text{paired samples}}}^{N} \sum_{i=1}^{M} \,R^2\!\bigl(\mathrm{marker}_i \mid B\bigr)$ - - A higher value of $\overline{R^2_B}$ indicates that the batch variable explains more of the variance in the data, which indicates a higher level of batch effects. | + A higher value of $\overline{R^2_B}_{global}$ or $\overline{R^2_B}_{cell\ type}$ indicates that the batch variable explains more of the variance in the data, which indicates a higher level of batch effects. | - A good performance on $\overline{R^2_B}_{global} but not on $\overline{R^2_B}$ might indicate that the batch effect correction is discarding cell type specific batch effects. | + A good performance on $\overline{R^2_B}_{global}$ but not on $\overline{R^2_B}_{cell\ type}$ might indicate that the batch effect correction is discarding cell type specific batch effects. | references: bibtex: @@ -57,6 +99,7 @@ info: # Whether a higher value represents a 'better' solution (required) maximize: false + # Component-specific parameters (optional) # arguments: # - name: "--n_neighbors" diff --git a/src/metrics/n_inconsistent_peaks/config.vsh.yaml b/src/metrics/n_inconsistent_peaks/config.vsh.yaml index e417211..e42a05c 100644 --- a/src/metrics/n_inconsistent_peaks/config.vsh.yaml +++ b/src/metrics/n_inconsistent_peaks/config.vsh.yaml @@ -7,17 +7,17 @@ name: n_inconsistent_peaks info: metrics: - name: n_inconsistent_peaks - label: Number of inconsistent peaks + label: Number of inconsistent peaks Global # A one sentence summary of how this metric works (required). Used when # rendering summary tables. - summary: "Compare the number of marker-expression peaks between validation and batch-normalized data." + summary: "Comparison of the number of marker‑expression peaks between validation and batch‑normalized data." # A multi-line description of how this component works (required). Used # when rendering reference documentation. description: | - The metric compares the number of marker-expression peaks between the validation and batch-normalized data. + The metric compares the number of marker expression peaks between the validation and batch-normalized data. The number of peaks is calculated using the `scipy.signal.find_peaks` function. The metric is calculated as the absolute difference between the number of peaks in the validation and batch-normalized data. - The marker-expression profiles are first smoothed using kernel density estimation (KDE) (`scipy.stats.gaussian_kde`), + The marker expression profiles are first smoothed using kernel density estimation (KDE) (`scipy.stats.gaussian_kde`), and then peaks are then identified using the `scipy.signal.find_peaks` function. For peak calling, the `prominence` parameter is set to 0.1 and the `height` parameter is set to 0.05*max_density. references: @@ -31,7 +31,32 @@ info: # The minimum possible value for this metric (required) min: 0 # The maximum possible value for this metric (required) - max: inf + max: +.inf + # Whether a higher value represents a 'better' solution (required) + maximize: false + + - name: n_inconsistent_peaks_ct + label: Number of inconsistent peaks (Cell Type) + summary: "Comparison of the number of cell‑type marker‑expression peaks between validation and batch‑normalized data." + description: | + The metric compares the number of cell type specific marker expression peaks between the validation and batch-normalized data. + The number of peaks is calculated using the `scipy.signal.find_peaks` function. + The metric is calculated as the absolute difference between the number of peaks in the validation and batch-normalized data. + The (cell type) marker expression profiles are first smoothed using kernel density estimation (KDE) (`scipy.stats.gaussian_kde`), + and then peaks are then identified using the `scipy.signal.find_peaks` function. + For peak calling, the `prominence` parameter is set to 0.1 and the `height` parameter is set to 0.05*max_density. + references: + doi: + - 10.1038/s41592-019-0686-2 + links: + # URL to the documentation for this metric (required). + documentation: https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.find_peaks.html#scipy.signal.find_peaks + # URL to the code repository for this metric (required). + repository: https://github.com/scipy/scipy/blob/v1.15.2/scipy/signal/_peak_finding.py#L0-L1 + # The minimum possible value for this metric (required) + min: 0 + # The maximum possible value for this metric (required) + max: +.inf # Whether a higher value represents a 'better' solution (required) maximize: false