Skip to content

Commit 7e4d3b5

Browse files
mumichaelazappircannood
authored
Add pegasus implementations of kBET (#53)
* add global kbet metric from pegasus implementation * add per label averaged kBET implementation from pegasus * fixed label kbet naming * add new kbet components to workflow * update CHANGELOG * update documentation on running the workflow * Apply suggestions from code review Co-authored-by: Luke Zappia <lazappi@users.noreply.github.com> * move changelog message to devel section * set to maximum available number of threads * Apply suggestions from code review Co-authored-by: Robrecht Cannoodt <rcannood@gmail.com> * Apply suggestions from code review --------- Co-authored-by: Luke Zappia <lazappi@users.noreply.github.com> Co-authored-by: Robrecht Cannoodt <rcannood@gmail.com>
1 parent a75dc6e commit 7e4d3b5

File tree

8 files changed

+258
-2
lines changed

8 files changed

+258
-2
lines changed

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
# task_batch_integration devel
22

3+
## New functionality
4+
5+
* Added `metrics/kbet_pg` and `metrics/kbet_pg_label` components (PR #52).
6+
37
## Minor changes
48

59
* Un-pin the scPRINT version and update parameters (PR #51)

CONTRIBUTING.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -141,12 +141,14 @@ viash ns test --parallel
141141

142142
### Running the benchmark
143143

144-
To run the benchmark, you can use the following command:
144+
To locally test the benchmark workflow, you can use the following command:
145145

146146
```bash
147-
scripts/run_benchmark/run.sh
147+
scripts/run_benchmark/run_test_local.sh
148148
```
149149

150+
Other scripts in `scripts/run_benchmark/` provide commands for testing on Seqera Cloud and for production runs.
151+
150152
## Debugging nf-tower runs
151153

152154
The actual benchmark is run on the [nf-tower platform](https://cloud.seqera.io/orgs/openproblems-bio/workspaces/openproblems-bio/watch).

src/metrics/kbet_pg/config.vsh.yaml

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
__merge__: /src/api/comp_metric.yaml
2+
name: kbet_pg
3+
info:
4+
metric_type: embedding
5+
metrics:
6+
- name: kbet_pg
7+
label: kBET pegasus
8+
summary: kBET algorithm to determine how well batches are mixed within a cell
9+
type
10+
# TODO: transform into more readable markdown with proper formulae formatting
11+
description: |
12+
The kBET algorithm (v.0.99.6, release 4c9dafa) determines whether the label composition
13+
of a k nearest neighborhood of a cell is similar to the expected (global) label
14+
composition (Buettner et al., Nat Meth 2019). The test is repeated for a random subset
15+
of cells, and the results are summarized as a rejection rate over all tested
16+
neighborhoods.
17+
18+
This implementation uses the `pegasus.calc_kBET` function.
19+
20+
In Open Problems we do not run kBET on graph outputs to avoid computation-intensive
21+
diffusion processes being run.
22+
references:
23+
doi: 10.1038/s41592-021-01336-8
24+
links:
25+
homepage: https://pegasus.readthedocs.io/en/stable/
26+
documentation: https://pegasus.readthedocs.io/en/stable/api/pegasus.calc_kBET.html
27+
repository: https://github.com/lilab-bcb/pegasus
28+
min: 0
29+
max: 1
30+
maximize: true
31+
resources:
32+
- type: python_script
33+
path: script.py
34+
- path: /src/utils/read_anndata_partial.py
35+
engines:
36+
- type: docker
37+
image: openproblems/base_python:1.0.0
38+
setup:
39+
- type: python
40+
pypi:
41+
- pegasuspy
42+
- zarr<3.0
43+
- pandas<2.0
44+
- numpy<2.0
45+
runners:
46+
- type: executable
47+
- type: nextflow
48+
directives:
49+
label: [hightime, veryhighmem, lowcpu]

src/metrics/kbet_pg/script.py

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
from tqdm import tqdm
2+
import sys
3+
import numpy as np
4+
import anndata as ad
5+
import pegasus as pg
6+
import pegasusio
7+
from scipy.sparse import csr_matrix
8+
9+
10+
def compute_kbet(mmdata, *args, **kwargs):
11+
stat_mean, pvalue_mean, accept_rate = pg.calc_kBET(mmdata, *args, **kwargs)
12+
return accept_rate
13+
14+
15+
## VIASH START
16+
par = {
17+
'input_integrated': 'resources_test/task_batch_integration/cxg_immune_cell_atlas/integrated_full.h5ad',
18+
'output': 'output.h5ad',
19+
}
20+
21+
meta = {
22+
'name': 'kbet_pg',
23+
}
24+
## VIASH END
25+
26+
sys.path.append(meta["resources_dir"])
27+
from read_anndata_partial import read_anndata
28+
29+
n_threads = meta["cpus"] or -1
30+
31+
print('Read input...', flush=True)
32+
adata = read_anndata(par['input_integrated'], obs='obs', obsm='obsm', uns='uns')
33+
adata.obs = read_anndata(par['input_solution'], obs='obs').obs
34+
adata.uns |= read_anndata(par['input_solution'], uns='uns').uns
35+
print(adata, flush=True)
36+
37+
print('Convert to pegasusio.MultimodalData...', flush=True)
38+
adata.X = csr_matrix(adata.shape)
39+
mmdata = pegasusio.MultimodalData(adata)
40+
41+
print('Compute global kBET...', flush=True)
42+
score = compute_kbet(
43+
mmdata,
44+
attr="batch",
45+
rep="emb",
46+
K=50,
47+
n_jobs=n_threads,
48+
)
49+
print('Global kBET score:', score, flush=True)
50+
51+
print('Create output AnnData object', flush=True)
52+
metric_name = meta['name']
53+
output = ad.AnnData(
54+
uns={
55+
'dataset_id': adata.uns['dataset_id'],
56+
'normalization_id': adata.uns['normalization_id'],
57+
'method_id': adata.uns['method_id'],
58+
'metric_ids': [ metric_name ],
59+
'metric_values': [ score ]
60+
}
61+
)
62+
63+
print('Write data to file', flush=True)
64+
output.write_h5ad(par['output'], compression='gzip')
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
__merge__: /src/api/comp_metric.yaml
2+
name: kbet_pg_label
3+
info:
4+
metric_type: embedding
5+
metrics:
6+
- name: kbet_pg_label
7+
label: kBET pegasus label
8+
summary: kBET algorithm to determine how well batches are mixed within a cell
9+
type
10+
description: |
11+
The kBET algorithm (v.0.99.6, release 4c9dafa) determines whether the label composition
12+
of a k nearest neighborhood of a cell is similar to the expected (global) label
13+
composition (Buettner et al., Nat Meth 2019). The test is repeated for a random subset
14+
of cells, and the results are summarized as a rejection rate over all tested
15+
neighborhoods.
16+
17+
This implementation uses the `pegasus.calc_kBET` function per cell type and the rejection
18+
rate is averaged over all cell types.
19+
20+
In Open Problems we do not run kBET on graph outputs to avoid computation-intensive
21+
diffusion processes being run.
22+
references:
23+
doi: 10.1038/s41592-021-01336-8
24+
links:
25+
homepage: https://pegasus.readthedocs.io/en/stable/
26+
documentation: https://pegasus.readthedocs.io/en/stable/api/pegasus.calc_kBET.html
27+
repository: https://github.com/lilab-bcb/pegasus
28+
min: 0
29+
max: 1
30+
maximize: true
31+
resources:
32+
- type: python_script
33+
path: script.py
34+
- path: /src/utils/read_anndata_partial.py
35+
engines:
36+
- type: docker
37+
image: openproblems/base_python:1.0.0
38+
setup:
39+
- type: python
40+
pypi:
41+
- pegasuspy
42+
- zarr<3.0
43+
- pandas<2.0
44+
- numpy<2.0
45+
runners:
46+
- type: executable
47+
- type: nextflow
48+
directives:
49+
label: [hightime, veryhighmem, lowcpu]

src/metrics/kbet_pg_label/script.py

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
from tqdm import tqdm
2+
import sys
3+
import numpy as np
4+
import anndata as ad
5+
import pegasus as pg
6+
import pegasusio
7+
from scipy.sparse import csr_matrix
8+
from joblib import Parallel, delayed
9+
10+
11+
def compute_kbet(mmdata, *args, **kwargs):
12+
stat_mean, pvalue_mean, accept_rate = pg.calc_kBET(mmdata, *args, **kwargs)
13+
return accept_rate
14+
15+
16+
## VIASH START
17+
par = {
18+
'input_integrated': 'resources_test/task_batch_integration/cxg_immune_cell_atlas/integrated_full.h5ad',
19+
'output': 'output.h5ad',
20+
}
21+
22+
meta = {
23+
'name': 'foo',
24+
}
25+
## VIASH END
26+
27+
sys.path.append(meta["resources_dir"])
28+
from read_anndata_partial import read_anndata
29+
30+
n_threads = meta["cpus"] or -1
31+
32+
print('Read input...', flush=True)
33+
adata = read_anndata(par['input_integrated'], obs='obs', obsm='obsm', uns='uns')
34+
adata.obs = read_anndata(par['input_solution'], obs='obs').obs
35+
adata.uns |= read_anndata(par['input_solution'], uns='uns').uns
36+
adata.X = csr_matrix(adata.shape)
37+
print(adata, flush=True)
38+
39+
print('Compute cell type averaged kBET...', flush=True)
40+
cell_types = adata.obs['cell_type'].unique()
41+
scores = []
42+
mmdata_per_cell_type = []
43+
44+
# collect all adata subsets
45+
for cell_type in cell_types:
46+
ad_sub = adata[adata.obs['cell_type'] == cell_type]
47+
if ad_sub.obs['batch'].nunique() <= 1:
48+
print(f'Skipping cell type {cell_type} because it\'s present in only one batch', flush=True)
49+
continue
50+
_mmdata = pegasusio.MultimodalData(ad_sub.copy())
51+
mmdata_per_cell_type.append(_mmdata)
52+
53+
# compute kBET scores
54+
scores = Parallel(n_jobs=n_threads)(
55+
delayed(compute_kbet)(
56+
_mmdata,
57+
attr="batch",
58+
rep="emb",
59+
K=50,
60+
use_cache=False,
61+
n_jobs=1,
62+
) for _mmdata in tqdm(
63+
mmdata_per_cell_type,
64+
desc=f'Compute per cell type with {n_threads} threads',
65+
miniters=1,
66+
)
67+
)
68+
score = np.nanmean(scores)
69+
print('Cell type averaged kBET score:', score, flush=True)
70+
71+
print('Create output AnnData object', flush=True)
72+
metric_name = meta['name']
73+
output = ad.AnnData(
74+
uns={
75+
'dataset_id': adata.uns['dataset_id'],
76+
'normalization_id': adata.uns['normalization_id'],
77+
'method_id': adata.uns['method_id'],
78+
'metric_ids': [ metric_name ],
79+
'metric_values': [ score ]
80+
}
81+
)
82+
83+
print('Write data to file', flush=True)
84+
output.write_h5ad(par['output'], compression='gzip')

src/workflows/run_benchmark/config.vsh.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -117,6 +117,8 @@ dependencies:
117117
- name: metrics/isolated_label_asw
118118
- name: metrics/isolated_label_f1
119119
- name: metrics/kbet
120+
- name: metrics/kbet_pg
121+
- name: metrics/kbet_pg_label
120122
- name: metrics/lisi
121123
- name: metrics/pcr
122124
# data processors

src/workflows/run_benchmark/main.nf

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,8 @@ metrics = [
5656
isolated_label_asw,
5757
isolated_label_f1,
5858
kbet,
59+
kbet_pg,
60+
kbet_pg_label,
5961
lisi,
6062
pcr
6163
]

0 commit comments

Comments
 (0)