Skip to content

Reimplement emd #48

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
May 6, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 9 additions & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,11 +47,16 @@
* Reverted PR #18 to retain only the 1st two changes (add is_control and remove
donor_id from unintegrated_censored).

* Change emd to emd_mean and emd_max (PR #27).
* Changed emd to emd_mean and emd_max (PR #27).

* Update output anndata for methods to return all vars - corrected or not (PR #28).
* Updated output anndata for methods to return all vars - corrected or not (PR #28).

* Add positive control (PR #30).
* Added positive control (PR #30).

* Rewrote EMD metric so it no longer relies on implementation in cytonormpy (PR #48):
* EMD is now calculated for all cell types as well.
* Returned values are average and max across all cell types (exclude the one calculated agnostic of cell types),
average and max across all donors (mean and max of values computed agnostic of cell types).

## MINOR CHANGES

Expand All @@ -63,7 +68,7 @@

* Updated scripts to enable running benchmark on seqera (PR #29).

* Update project description (PR #42).
* Updated project description (PR #42).

* Implemented function for writing FCS files from an anndata object (PR #49)

Expand Down
155 changes: 138 additions & 17 deletions src/metrics/emd/config.vsh.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,52 +14,171 @@ info:
metrics:
# A unique identifier for your metric (required).
# Can contain only lowercase letters or underscores.
- name: emd_mean
- name: emd_mean_ct
# A relatively short label, used when rendering visualisarions (required)
label: EMD Mean
label: EMD Mean CT
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LuLeom I expanded the return values to 4. So now we have mean and max calculated across all cell types, then mean and max across donors (see EMD Mean DN). See comment on the implementation in script.py file. I'll explain there.

# A one sentence summary of how this metric works (required). Used when
# rendering summary tables.
summary: "Mean Earth Mover Distance to compute differences in distribution of marker expressions."
summary: "Mean Earth Mover Distance across cell types and markers."
# A multi-line description of how this component works (required). Used
# when rendering reference documentation.
description: |
Earth Mover Distance (EMD) is a metric designed for comparing two distributions.
It is also known as the Wasserstein metric.
Earth Mover Distance (EMD), also known as the Wasserstein metric, measures the difference
between two probability distributions.

Here, EMD is used to compare marker expression distributions between paired samples from the same donor
quantified across two different batches.
For each paired sample, cell type, and marker, the marker expression values are first converted into
probability distributions.
This is done by binning the expression values into a range from -100 to 100 with a bin width of 0.1.
The `wasserstein_distance` function from SciPy is then used to calculate the EMD between the two
probability distributions belonging to the same cell type, marker, and a given paired samples.
This is then repeated for every cell type, marker, and paired sample.
Finally, the average of all these EMD values is computed to produce an overall metric score EMD Mean CT.

A high score indicates large overall differences in the distributions of marker expressions
between the paired samples, suggesting poor batch integration.
A low score means the small differences in marker expression distributions between batches,
indicating good batch integration.

references:
doi:
- 10.1023/A:1026543900054
links:
# URL to the documentation for this metric (required).
documentation: https://cytonormpy.readthedocs.io/en/latest/generated/cytonormpy.emd_comparison_from_anndata.html
documentation: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wasserstein_distance.html
# URL to the code repository for this metric (required).
repository: https://github.com/TarikExner/CytoNormPy
repository: https://github.com/scipy/scipy
# The minimum possible value for this metric (required)
min: 0
# The maximum possible value for this metric (required)
max: .inf
# Whether a higher value represents a 'better' solution (required)
maximize: false

# A unique identifier for your metric (required).
# Can contain only lowercase letters or underscores.
- name: emd_max
- name: emd_max_ct
# A relatively short label, used when rendering visualisarions (required)
label: EMD Max
label: EMD Max CT
# A one sentence summary of how this metric works (required). Used when
# rendering summary tables.
summary: "Max Earth Mover Distance to compute differences in distribution of marker expressions."
summary: "Max Earth Mover Distance across cell types and markers."
# A multi-line description of how this component works (required). Used
# when rendering reference documentation.
description: |
Earth Mover Distance (EMD) is a metric designed for comparing two distributions.
It is also known as the Wasserstein metric.
Earth Mover Distance (EMD), also known as the Wasserstein metric, measures the difference
between two probability distributions.

Here, EMD is used to compare marker expression distributions between paired samples from the same donor
quantified across two different batches.
For each paired sample, cell type, and marker, the marker expression values are first converted into
probability distributions.
This is done by binning the expression values into a range from -100 to 100 with a bin width of 0.1.
The `wasserstein_distance` function from SciPy is then used to calculate the EMD between the two
probability distributions belonging to the same cell type, marker, and a given paired samples.
This is then repeated for every cell type, marker, and paired sample.
Finally, the maximum of all these EMD values is computed as EMD Max CT.

EMD Max CT score reflects the largest difference in marker expression distributions across all cell types,
markers, and paired samples.
A high score indicates that at least one marker, cell type, or sample pair has a large difference in
distribution after batch integration.
A low score means that even the most poorly corrected marker expression is well integrated across batches.
references:
doi:
- 10.1023/A:1026543900054
links:
# URL to the documentation for this metric (required).
documentation: https://cytonormpy.readthedocs.io/en/latest/generated/cytonormpy.emd_comparison_from_anndata.html
documentation: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wasserstein_distance.html
# URL to the code repository for this metric (required).
repository: https://github.com/TarikExner/CytoNormPy
repository: https://github.com/scipy/scipy
# The minimum possible value for this metric (required)
min: 0
# The maximum possible value for this metric (required)
max: .inf
# Whether a higher value represents a 'better' solution (required)
maximize: false

# A unique identifier for your metric (required).
# Can contain only lowercase letters or underscores.
- name: emd_mean_global
# A relatively short label, used when rendering visualisarions (required)
label: EMD Mean Global
# A one sentence summary of how this metric works (required). Used when
# rendering summary tables.
summary: "Mean Earth Mover Distance across samples and markers."
# A multi-line description of how this component works (required). Used
# when rendering reference documentation.
description: |
Earth Mover Distance (EMD), also known as the Wasserstein metric, measures the difference
between two probability distributions.

Here, EMD is used to compare marker expression distributions between paired samples from the same donor
quantified across two different batches.
For each paired sample and marker, the marker expression values are first converted into
probability distributions.
This is done by binning the expression values into a range from -100 to 100 with a bin width of 0.1.
The `wasserstein_distance` function from SciPy is then used to calculate the EMD between the two
probability distributions belonging to the same cell type, marker, and a given paired samples.
This is then repeated for every marker and paired sample.
Finally, the average of all these EMD values is computed to produce an overall metric score EMD Mean Global.

A high score indicates that at least one marker and cell type in a given sample pair has a
large difference in distribution after batch integration.
A low score means that the most poorly corrected marker expression is well integrated across batches.
references:
doi:
- 10.1023/A:1026543900054
links:
# URL to the documentation for this metric (required).
documentation: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wasserstein_distance.html
# URL to the code repository for this metric (required).
repository: https://github.com/scipy/scipy
# The minimum possible value for this metric (required)
min: 0
# The maximum possible value for this metric (required)
max: .inf
# Whether a higher value represents a 'better' solution (required)
maximize: false

# A unique identifier for your metric (required).
# Can contain only lowercase letters or underscores.
- name: emd_max_global
# A relatively short label, used when rendering visualisarions (required)
label: EMD Max Global
# A one sentence summary of how this metric works (required). Used when
# rendering summary tables.
summary: "Max Earth Mover Distance across donors and markers."
# A multi-line description of how this component works (required). Used
# when rendering reference documentation.
description: |
Earth Mover Distance (EMD), also known as the Wasserstein metric, measures the difference
between two probability distributions.

Here, EMD is used to compare marker expression distributions between paired samples from the same donor
quantified across two different batches.
For each paired sample and marker, the marker expression values are first converted into
probability distributions.
This is done by binning the expression values into a range from -100 to 100 with a bin width of 0.1.
The `wasserstein_distance` function from SciPy is then used to calculate the EMD between the two
probability distributions belonging to the same cell type, marker, and a given paired samples.
This is then repeated for every cell type, marker, and paired sample.
Finally, the maximum of all these EMD values is computed as EMD Max Global.

EMD Max Global score reflects the largest difference in marker expression distributions
across all markers and paired samples.
A high score indicates that at least one marker in a given sample pair has a large difference in
distribution after batch integration.
A low score means that the most poorly corrected marker expression is well integrated across batches.
references:
doi:
- 10.1023/A:1026543900054
links:
# URL to the documentation for this metric (required).
documentation: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wasserstein_distance.html
# URL to the code repository for this metric (required).
repository: https://github.com/scipy/scipy
# The minimum possible value for this metric (required)
min: 0
# The maximum possible value for this metric (required)
Expand All @@ -72,7 +191,10 @@ resources:
# The script of your component (required)
- type: python_script
path: script.py
- path: /src/utils/helper_functions.py
- type: python_script
path: helper.py
- type: python_script
path: /src/utils/helper_functions.py

engines:
# Specifications for the Docker image for this component.
Expand All @@ -82,8 +204,7 @@ engines:
# https://viash.io/reference/config/engines/docker/#setup .
setup:
- type: python
packages: [anndata, numpy, pandas]
github: [TarikExner/CytoNormPy]
packages: [anndata, numpy, pandas, scipy]

runners:
# This platform allows running the component natively
Expand Down
81 changes: 81 additions & 0 deletions src/metrics/emd/helper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
import anndata as ad
import numpy as np
import pandas as pd
from scipy.stats import wasserstein_distance


def compute_emd(
integrated_ct: ad.AnnData,
validation_ct: ad.AnnData,
markers_to_assess: list
) -> pd.DataFrame:
"""
Calculate EMD metric

Args:
integrated_ct (ad.AnnData): batch integrated data
validation_ct (ad.AnnData): validation data
markers_to_assess (list): list of markers to compute EMD for

Returns:
pd.DataFrame: 1 row data frame where a column is a marker. Value is the EMD.
"""


emd_vals = {}

for marker in markers_to_assess:
# marker = markers_to_assess[0]

mexp_integrated = np.array(integrated_ct[:, marker].layers["integrated"]).flatten()
mexp_validation = np.array(validation_ct[:, marker].layers["preprocessed"]).flatten()

i_values, i_weights = bin_array(mexp_integrated)
v_values, v_weights = bin_array(mexp_validation)

# i_values (and v_values) are the explicit support (set of all possible bin values)
# of the probability distribution i_weights (and v_weights).
emd = wasserstein_distance(i_values, v_values, i_weights, v_weights)
emd_vals[marker] = [emd]

emd_df = pd.DataFrame.from_dict(emd_vals)

return emd_df


def bin_array(values):
"""
Bin values into probability distribution.

Args:
values (list): values to bin

Returns:
list: Bin indices - centre of the bin.
list: Probability distribution of the input values.
"""

# 2000 bins, the 0.0000001 is to avoid the left edge being included in the bin
# (Mainly impacting 0 values)
# range is set to -100 to 100 with the assumption that the range of values for each marker
# will not exceed this
bin_edges = np.arange(-100, 100.1, 0.1)+0.0000001

# using histogram retains the physical meaning of distances between bins,
# such that moving mass from bin [-5.0, -4.9) to [-4.9, -4.8) has lower cost than
# moving it to [99.9, 100.0)
counts_per_bin, _ = np.histogram(values, bins=bin_edges)

# this converts distribution of absolute marker values to probability distribution.
# it allows subsequent EMD comparison between datasets of different sizes (number of cells).
bin_probabilities = counts_per_bin / np.sum(counts_per_bin)

# if bin_edges = [0,1,2,3,4,5]
# bin_edges[:-1] will give you [0,1,2,3,4]
# bin_edges[1:] will give you [1,2,3,4,5]
# sum will sum each element up and divide by 2 will give you the centre of the bin
# so for the 1st bin, (0+1)/2 = 0.5
bin_indices = (bin_edges[:-1] + bin_edges[1:]) / 2

# the 1st return value is the bin indices
return bin_indices, bin_probabilities
Loading