openproblems-bio · rcannood · Mar 19, 2025 · Mar 19, 2025 · Mar 19, 2025 · Mar 19, 2025
diff --git a/src/metrics/average_batch_r2/config.vsh.yaml b/src/metrics/average_batch_r2/config.vsh.yaml
@@ -8,20 +8,64 @@ info:
   metrics:
       # A unique identifier for your metric (required).
       # Can contain only lowercase letters or underscores.
-    - name: average_batch_r2
+    - name: average_batch_r2_global
       # A relatively short label, used when rendering visualisarions (required)
-      label: Average Batch R-squared ($\overline{R^2_B}$)
+      label: Average Batch R-squared Global
       # A one sentence summary of how this metric works (required). Used when 
       # rendering summary tables.
       summary: "The average batch R-squared quantifies, on average, how strongly the batch variable B explains the variance in the data."
       # A multi-line description of how this component works (required). Used
       # when rendering reference documentation.
       description: |
-        First, a simple linear model `sklearn.linear_model.LinearRegression` is fitted for each paired sample, marker (and cell type) to determine the fraction of variance (R^2) explained by the batch covariate B. |
-        The average batch R_squared is then computed as the average of the $R^2$ values across all paired samples, markers (and cell types). |
-        As a result, $\overline{R^2_B}$ quantifies how much of the total variability in the data is driven by batch effects. Consequently, a lower values are desirable. |
+        First, a simple linear model `sklearn.linear_model.LinearRegression` is fitted for each paired sample and marker to determine the fraction of variance (R^2) explained by the batch covariate B. |
+        The average batch R_squared is then computed as the average of the $R^2$ values across all paired samples, markers. |
+        As a result, $\overline{R^2_B}_{global}$ quantifies how much of the total variability in the data is driven by batch effects. Consequently, lower values are desirable. |
 
-        $\overline{R^2_B} \text{} = \frac{1}{N*C*M}\sum_{\substack{(x_{\mathrm{int}},\,x_{\mathrm{val}})\\ \text{paired samples}}}^{N} \sum_{j=1}^{C} \sum_{i=1}^{M}\,R^2\!\bigl(\mathrm{marker}_i \mid B\bigr)$
+        $\overline{R^2_B}_{global} = \frac{1}{N*M}\sum_{\substack{(x_{\mathrm{int}},\,x_{\mathrm{val}})\\ \text{paired samples}}}^{N} \sum_{i=1}^{M} \,R^2\!\bigl(\mathrm{marker}_i \mid B\bigr)$
+
+        Where:
+        - $N$ is the number of paired samples, where x_{\mathrm{int}} is the replicate that has been batch-corrected and x_{\mathrm{val}} is replicate used for validation. Paired samples belong to different batches.
+        - $M$ is the number of markers
+        - $B$ is the batch covariate
+
+        A higher value of $\overline{R^2_B}_{global}$ indicates that the batch variable explains more of the variance in the data, which indicates a higher level of batch effects. |
+
+
+      references:
+        bibtex:
+          - |
+            @book{draper1998applied,
+            title={Applied regression analysis},
+            author={Draper, Norman R and Smith, Harry},
+            publisher={John Wiley \& Sons}
+            }
+      links:
+        # URL to the documentation for this metric (required).
+        documentation: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
+        # URL to the code repository for this metric (required).
+        repository: https://github.com/scikit-learn/scikit-learn
+      # The minimum possible value for this metric (required)
+      min: -0.001
+      # The maximum possible value for this metric (required)
+      max: 1
+      # Whether a higher value represents a 'better' solution (required)
+      maximize: false
+
+
+    - name: average_batch_r2_ct
+      # A relatively short label, used when rendering visualisarions (required)
+      label: Average Batch R-squared Cell Type
+      # A one sentence summary of how this metric works (required). Used when 
+      # rendering summary tables.
+      summary: "The average batch R-squared Cell Type quantifies, on average, how strongly the batch variable B explains the variance in the data (by taking into account cell type effect)."
+      # A multi-line description of how this component works (required). Used
+      # when rendering reference documentation.
+      description: |
+        First, a simple linear model `sklearn.linear_model.LinearRegression` is fitted for each paired sample, marker and cell type to determine the fraction of variance (R^2) explained by the batch covariate B. |
+        The average batch R_squared is then computed as the average of the $R^2$ values across all paired samples, markers and cell types. |
+        As a result, $\overline{R^2_B}_{cell\ type}$ quantifies how much of the total variability in the data is driven by batch effects. Consequently, lower values are desirable. |
+
+        $\overline{R^2_B}_{cell\ type} = \frac{1}{N*C*M}\sum_{\substack{(x_{\mathrm{int}},\,x_{\mathrm{val}})\\ \text{paired samples}}}^{N} \sum_{j=1}^{C} \sum_{i=1}^{M}\,R^2\!\bigl(\mathrm{marker}_i \mid B\bigr)$
 
         Where:
         - $N$ is the number of paired samples, where x_{\mathrm{int}} is the replicate that has been batch-corrected and x_{\mathrm{val}} is replicate used for validation. Paired samples belong to different batches.
@@ -31,11 +75,9 @@ info:
 
         The $\overline{Rˆ2_B}_{global}$ is a variation of the latter metric, where the average is computed across paired samples and markers only, without taking into account the cell types. |
 
-        $\overline{R^2_B}_{global} = \frac{1}{N*M}\sum_{\substack{(x_{\mathrm{int}},\,x_{\mathrm{val}})\\ \text{paired samples}}}^{N} \sum_{i=1}^{M} \,R^2\!\bigl(\mathrm{marker}_i \mid B\bigr)$
-
-        A higher value of $\overline{R^2_B}$ indicates that the batch variable explains more of the variance in the data, which indicates a higher level of batch effects. |
+        A higher value of $\overline{R^2_B}_{global}$ or $\overline{R^2_B}_{cell\ type}$ indicates that the batch variable explains more of the variance in the data, which indicates a higher level of batch effects. |
 
-        A good performance on $\overline{R^2_B}_{global} but not on $\overline{R^2_B}$ might indicate that the batch effect correction is discarding cell type specific batch effects. |
+        A good performance on $\overline{R^2_B}_{global}$ but not on $\overline{R^2_B}_{cell\ type}$ might indicate that the batch effect correction is discarding cell type specific batch effects. |
 
       references:
         bibtex:
@@ -57,6 +99,7 @@ info:
       # Whether a higher value represents a 'better' solution (required)
       maximize: false
 
+
 # Component-specific parameters (optional)
 # arguments:
 #   - name: "--n_neighbors"

diff --git a/src/metrics/n_inconsistent_peaks/config.vsh.yaml b/src/metrics/n_inconsistent_peaks/config.vsh.yaml
@@ -7,17 +7,17 @@ name: n_inconsistent_peaks
 info:
   metrics:
     - name: n_inconsistent_peaks
-      label: Number of inconsistent peaks
+      label: Number of inconsistent peaks Global
       # A one sentence summary of how this metric works (required). Used when 
       # rendering summary tables.
-      summary: "Compare the number of marker-expression peaks between validation and batch-normalized data."
+      summary: "Comparison of the number of marker‑expression peaks between validation and batch‑normalized data."
       # A multi-line description of how this component works (required). Used
       # when rendering reference documentation.
       description: |
-        The metric compares the number of marker-expression peaks between the validation and batch-normalized data. 
+        The metric compares the number of marker expression peaks between the validation and batch-normalized data. 
         The number of peaks is calculated using the `scipy.signal.find_peaks` function. 
         The metric is calculated as the absolute difference between the number of peaks in the validation and batch-normalized data.
-        The marker-expression profiles are first smoothed using kernel density estimation (KDE) (`scipy.stats.gaussian_kde`),
+        The marker expression profiles are first smoothed using kernel density estimation (KDE) (`scipy.stats.gaussian_kde`),
         and then peaks are then identified using the `scipy.signal.find_peaks` function.
         For peak calling, the `prominence` parameter is set to 0.1 and the `height` parameter is set to 0.05*max_density.
       references:
@@ -31,7 +31,32 @@ info:
       # The minimum possible value for this metric (required)
       min: 0
       # The maximum possible value for this metric (required)
-      max: inf
+      max: +.inf
+      # Whether a higher value represents a 'better' solution (required)
+      maximize: false
+
+    - name: n_inconsistent_peaks_ct
+      label: Number of inconsistent peaks (Cell Type)
+      summary: "Comparison of the number of cell‑type marker‑expression peaks between validation and batch‑normalized data."
+      description: |
+        The metric compares the number of cell type specific marker expression peaks between the validation and batch-normalized data. 
+        The number of peaks is calculated using the `scipy.signal.find_peaks` function. 
+        The metric is calculated as the absolute difference between the number of peaks in the validation and batch-normalized data.
+        The (cell type) marker expression profiles are first smoothed using kernel density estimation (KDE) (`scipy.stats.gaussian_kde`),
+        and then peaks are then identified using the `scipy.signal.find_peaks` function.
+        For peak calling, the `prominence` parameter is set to 0.1 and the `height` parameter is set to 0.05*max_density.
+      references:
+        doi: 
+          - 10.1038/s41592-019-0686-2
+      links:
+        # URL to the documentation for this metric (required).
+        documentation: https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.find_peaks.html#scipy.signal.find_peaks
+        # URL to the code repository for this metric (required).
+        repository: https://github.com/scipy/scipy/blob/v1.15.2/scipy/signal/_peak_finding.py#L0-L1
+      # The minimum possible value for this metric (required)
+      min: 0
+      # The maximum possible value for this metric (required)
+      max: +.inf
       # Whether a higher value represents a 'better' solution (required)
       maximize: false