openproblems-bio · ghar1821 · May 6, 2025 · Apr 14, 2025 · Apr 16, 2025 · Apr 16, 2025
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -47,11 +47,16 @@
   * Reverted PR #18 to retain only the 1st two changes (add is_control and remove 
   donor_id from unintegrated_censored).
 
-* Change emd to emd_mean and emd_max (PR #27).
+* Changed emd to emd_mean and emd_max (PR #27).
 
-* Update output anndata for methods to return all vars - corrected or not (PR #28).
+* Updated output anndata for methods to return all vars - corrected or not (PR #28).
 
-* Add positive control (PR #30).
+* Added positive control (PR #30).
+
+* Rewrote EMD metric so it no longer relies on implementation in cytonormpy (PR #48):
+  * EMD is now calculated for all cell types as well.
+  * Returned values are average and max across all cell types (exclude the one calculated agnostic of cell types),
+  average and max across all donors (mean and max of values computed agnostic of cell types).
 
 ## MINOR CHANGES
 
@@ -63,7 +68,7 @@
 
 * Updated scripts to enable running benchmark on seqera (PR #29).
 
-* Update project description (PR #42).
+* Updated project description (PR #42).
 
 * Implemented function for writing FCS files from an anndata object (PR #49)
 

diff --git a/src/metrics/emd/config.vsh.yaml b/src/metrics/emd/config.vsh.yaml
@@ -14,52 +14,171 @@ info:
   metrics:
       # A unique identifier for your metric (required).
       # Can contain only lowercase letters or underscores.
-    - name: emd_mean
+    - name: emd_mean_ct
       # A relatively short label, used when rendering visualisarions (required)
-      label: EMD Mean
+      label: EMD Mean CT
       # A one sentence summary of how this metric works (required). Used when 
       # rendering summary tables.
-      summary: "Mean Earth Mover Distance to compute differences in distribution of marker expressions."
+      summary: "Mean Earth Mover Distance across cell types and markers."
       # A multi-line description of how this component works (required). Used
       # when rendering reference documentation.
       description: |
-        Earth Mover Distance (EMD) is a metric designed for comparing two distributions.
-        It is also known as the Wasserstein metric.
+        Earth Mover Distance (EMD), also known as the Wasserstein metric, measures the difference 
+        between two probability distributions. 
+
+        Here, EMD is used to compare marker expression distributions between paired samples from the same donor 
+        quantified across two different batches. 
+        For each paired sample, cell type, and marker, the marker expression values are first converted into 
+        probability distributions. 
+        This is done by binning the expression values into a range from -100 to 100 with a bin width of 0.1.
+        The `wasserstein_distance` function from SciPy is then used to calculate the EMD between the two 
+        probability distributions belonging to the same cell type, marker, and a given paired samples.
+        This is then repeated for every cell type, marker, and paired sample.
+        Finally, the average of all these EMD values is computed to produce an overall metric score EMD Mean CT.
+
+        A high score indicates large overall differences in the distributions of marker expressions 
+        between the paired samples, suggesting poor batch integration.
+        A low score means the small differences in marker expression distributions between batches, 
+        indicating good batch integration.
+
       references:
         doi: 
           - 10.1023/A:1026543900054
       links:
         # URL to the documentation for this metric (required).
-        documentation: https://cytonormpy.readthedocs.io/en/latest/generated/cytonormpy.emd_comparison_from_anndata.html
+        documentation: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wasserstein_distance.html
         # URL to the code repository for this metric (required).
-        repository: https://github.com/TarikExner/CytoNormPy
+        repository: https://github.com/scipy/scipy
       # The minimum possible value for this metric (required)
       min: 0
       # The maximum possible value for this metric (required)
       max: .inf
       # Whether a higher value represents a 'better' solution (required)
       maximize: false
+
       # A unique identifier for your metric (required).
       # Can contain only lowercase letters or underscores.
-    - name: emd_max
+    - name: emd_max_ct
       # A relatively short label, used when rendering visualisarions (required)
-      label: EMD Max
+      label: EMD Max CT
       # A one sentence summary of how this metric works (required). Used when 
       # rendering summary tables.
-      summary: "Max Earth Mover Distance to compute differences in distribution of marker expressions."
+      summary: "Max Earth Mover Distance across cell types and markers."
       # A multi-line description of how this component works (required). Used
       # when rendering reference documentation.
       description: |
-        Earth Mover Distance (EMD) is a metric designed for comparing two distributions.
-        It is also known as the Wasserstein metric.
+        Earth Mover Distance (EMD), also known as the Wasserstein metric, measures the difference 
+        between two probability distributions. 
+
+        Here, EMD is used to compare marker expression distributions between paired samples from the same donor 
+        quantified across two different batches. 
+        For each paired sample, cell type, and marker, the marker expression values are first converted into 
+        probability distributions. 
+        This is done by binning the expression values into a range from -100 to 100 with a bin width of 0.1.
+        The `wasserstein_distance` function from SciPy is then used to calculate the EMD between the two 
+        probability distributions belonging to the same cell type, marker, and a given paired samples.
+        This is then repeated for every cell type, marker, and paired sample.
+        Finally, the maximum of all these EMD values is computed as EMD Max CT.
+
+        EMD Max CT score reflects the largest difference in marker expression distributions across all cell types, 
+        markers, and paired samples.
+        A high score indicates that at least one marker, cell type, or sample pair has a large difference in 
+        distribution after batch integration.
+        A low score means that even the most poorly corrected marker expression is well integrated across batches.    
       references:
         doi: 
           - 10.1023/A:1026543900054
       links:
         # URL to the documentation for this metric (required).
-        documentation: https://cytonormpy.readthedocs.io/en/latest/generated/cytonormpy.emd_comparison_from_anndata.html
+        documentation: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wasserstein_distance.html
         # URL to the code repository for this metric (required).
-        repository: https://github.com/TarikExner/CytoNormPy
+        repository: https://github.com/scipy/scipy
+      # The minimum possible value for this metric (required)
+      min: 0
+      # The maximum possible value for this metric (required)
+      max: .inf
+      # Whether a higher value represents a 'better' solution (required)
+      maximize: false
+
+      # A unique identifier for your metric (required).
+      # Can contain only lowercase letters or underscores.
+    - name: emd_mean_global
+      # A relatively short label, used when rendering visualisarions (required)
+      label: EMD Mean Global
+      # A one sentence summary of how this metric works (required). Used when 
+      # rendering summary tables.
+      summary: "Mean Earth Mover Distance across samples and markers."
+      # A multi-line description of how this component works (required). Used
+      # when rendering reference documentation.
+      description: |
+        Earth Mover Distance (EMD), also known as the Wasserstein metric, measures the difference 
+        between two probability distributions. 
+
+        Here, EMD is used to compare marker expression distributions between paired samples from the same donor 
+        quantified across two different batches. 
+        For each paired sample and marker, the marker expression values are first converted into 
+        probability distributions. 
+        This is done by binning the expression values into a range from -100 to 100 with a bin width of 0.1.
+        The `wasserstein_distance` function from SciPy is then used to calculate the EMD between the two 
+        probability distributions belonging to the same cell type, marker, and a given paired samples.
+        This is then repeated for every marker and paired sample.
+        Finally, the average of all these EMD values is computed to produce an overall metric score EMD Mean Global.
+
+        A high score indicates that at least one marker and cell type in a given sample pair has a 
+        large difference in distribution after batch integration.
+        A low score means that the most poorly corrected marker expression is well integrated across batches.   
+      references:
+        doi: 
+          - 10.1023/A:1026543900054
+      links:
+        # URL to the documentation for this metric (required).
+        documentation: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wasserstein_distance.html
+        # URL to the code repository for this metric (required).
+        repository: https://github.com/scipy/scipy
+      # The minimum possible value for this metric (required)
+      min: 0
+      # The maximum possible value for this metric (required)
+      max: .inf
+      # Whether a higher value represents a 'better' solution (required)
+      maximize: false
+
+      # A unique identifier for your metric (required).
+      # Can contain only lowercase letters or underscores.
+    - name: emd_max_global
+      # A relatively short label, used when rendering visualisarions (required)
+      label: EMD Max Global
+      # A one sentence summary of how this metric works (required). Used when 
+      # rendering summary tables.
+      summary: "Max Earth Mover Distance across donors and markers."
+      # A multi-line description of how this component works (required). Used
+      # when rendering reference documentation.
+      description: |
+        Earth Mover Distance (EMD), also known as the Wasserstein metric, measures the difference 
+        between two probability distributions. 
+
+        Here, EMD is used to compare marker expression distributions between paired samples from the same donor 
+        quantified across two different batches. 
+        For each paired sample and marker, the marker expression values are first converted into 
+        probability distributions. 
+        This is done by binning the expression values into a range from -100 to 100 with a bin width of 0.1.
+        The `wasserstein_distance` function from SciPy is then used to calculate the EMD between the two 
+        probability distributions belonging to the same cell type, marker, and a given paired samples.
+        This is then repeated for every cell type, marker, and paired sample.
+        Finally, the maximum of all these EMD values is computed as EMD Max Global.
+
+        EMD Max Global score reflects the largest difference in marker expression distributions 
+        across all markers and paired samples.
+        A high score indicates that at least one marker in a given sample pair has a large difference in 
+        distribution after batch integration.
+        A low score means that the most poorly corrected marker expression is well integrated across batches.   
+      references:
+        doi: 
+          - 10.1023/A:1026543900054
+      links:
+        # URL to the documentation for this metric (required).
+        documentation: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wasserstein_distance.html
+        # URL to the code repository for this metric (required).
+        repository: https://github.com/scipy/scipy
       # The minimum possible value for this metric (required)
       min: 0
       # The maximum possible value for this metric (required)
@@ -72,7 +191,10 @@ resources:
   # The script of your component (required)
   - type: python_script
     path: script.py
-  - path: /src/utils/helper_functions.py
+  - type: python_script
+    path: helper.py
+  - type: python_script
+    path: /src/utils/helper_functions.py
 
 engines:
   # Specifications for the Docker image for this component.
@@ -82,8 +204,7 @@ engines:
     # https://viash.io/reference/config/engines/docker/#setup .
     setup:
       - type: python
-        packages: [anndata, numpy, pandas]
-        github: [TarikExner/CytoNormPy]
+        packages: [anndata, numpy, pandas, scipy]
 
 runners:
   # This platform allows running the component natively

diff --git a/src/metrics/emd/helper.py b/src/metrics/emd/helper.py
@@ -0,0 +1,81 @@
+import anndata as ad
+import numpy as np
+import pandas as pd
+from scipy.stats import wasserstein_distance
+
+
+def compute_emd(
+    integrated_ct: ad.AnnData, 
+    validation_ct: ad.AnnData, 
+    markers_to_assess: list
+) -> pd.DataFrame:
+    """
+    Calculate EMD metric
+
+    Args:
+        integrated_ct (ad.AnnData): batch integrated data
+        validation_ct (ad.AnnData): validation data
+        markers_to_assess (list): list of markers to compute EMD for
+
+    Returns:
+        pd.DataFrame: 1 row data frame where a column is a marker. Value is the EMD.
+    """
+
+
+    emd_vals = {}
+
+    for marker in markers_to_assess:
+        # marker = markers_to_assess[0]
+
+        mexp_integrated = np.array(integrated_ct[:, marker].layers["integrated"]).flatten()
+        mexp_validation = np.array(validation_ct[:, marker].layers["preprocessed"]).flatten()
+
+        i_values, i_weights = bin_array(mexp_integrated)
+        v_values, v_weights = bin_array(mexp_validation)
+
+        # i_values (and v_values) are the explicit support (set of all possible bin values)
+        # of the probability distribution i_weights (and v_weights).
+        emd = wasserstein_distance(i_values, v_values, i_weights, v_weights)
+        emd_vals[marker] = [emd]
+
+    emd_df = pd.DataFrame.from_dict(emd_vals)
+
+    return emd_df
+
+
+def bin_array(values):
+    """
+    Bin values into probability distribution.
+
+    Args:
+        values (list): values to bin
+
+    Returns:
+        list: Bin indices - centre of the bin.
+        list: Probability distribution of the input values.
+    """
+
+    # 2000 bins, the 0.0000001 is to avoid the left edge being included in the bin 
+    # (Mainly impacting 0 values)
+    # range is set to -100 to 100 with the assumption that the range of values for each marker
+    # will not exceed this
+    bin_edges = np.arange(-100, 100.1, 0.1)+0.0000001
+
+    # using histogram retains the physical meaning of distances between bins,
+    # such that moving mass from bin [-5.0, -4.9) to [-4.9, -4.8) has lower cost than 
+    # moving it to [99.9, 100.0)
+    counts_per_bin, _ = np.histogram(values, bins=bin_edges)
+
+    # this converts distribution of absolute marker values to probability distribution.
+    # it allows subsequent EMD comparison between datasets of different sizes (number of cells). 
+    bin_probabilities = counts_per_bin / np.sum(counts_per_bin)
+
+    # if bin_edges = [0,1,2,3,4,5]
+    # bin_edges[:-1] will give you [0,1,2,3,4]
+    # bin_edges[1:] will give you [1,2,3,4,5]
+    # sum will sum each element up and divide by 2 will give you the centre of the bin 
+    # so for the 1st bin, (0+1)/2 = 0.5
+    bin_indices = (bin_edges[:-1] + bin_edges[1:]) / 2
+
+    # the 1st return value is the bin indices
+    return bin_indices, bin_probabilities