Skip to content

Commit 0a092ee

Browse files
authored
Merge pull request #1 from KrishnaswamyLab/dev
Changed fit and transform functions in multiscale_phate.py
2 parents fa82023 + 3e2daa6 commit 0a092ee

File tree

11 files changed

+1669
-115
lines changed

11 files changed

+1669
-115
lines changed

.travis.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ script:
1212
- nose2
1313
deploy:
1414
provider: pypi
15-
user: scottgigante
15+
user: mkuchroo
1616
password: ${PYPI_PASSWORD}
1717
distributions: sdist bdist_wheel
1818
skip_existing: true

README.md

Lines changed: 15 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
Multiscale_PHATE
1+
Multiscale PHATE
22
================
33

44
[![Latest PyPi version](https://img.shields.io/pypi/v/multiscale_phate.svg)](https://pypi.org/project/multiscale_phate/)
@@ -8,36 +8,35 @@ Multiscale_PHATE
88
[![GitHub stars](https://img.shields.io/github/stars/KrishnaswamyLab/Multiscale_PHATE.svg?style=social&label=Stars)](https://github.com/KrishnaswamyLab/Multiscale_PHATE/)
99
[![Code style: Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
1010

11-
This is a short description of the package.
11+
Multiscale PHATE is a python package for multiresolution analysis of high dimensional data. For an in-depth explanation of the algorithm and applications, please read our manuscript on [BioRxiv](https://www.biorxiv.org/content/10.1101/2020.11.15.383661v1.article-info).
12+
13+
The biomedical community is producing increasingly high dimensional datasets integrated from hundreds of patient samples that current computational techniques are unable to explore. Current tools for dimensionality reduction, such as tSNE, UMAP, and PCA, and clustering, such as Louvain and Leiden, only show a single salient level of granularity in biomedical data. When applied to cellular datasets currently being produced, these techniques are able to visualize and cluster major cell types such as B cells, T cells and myeloid cells. Differences between patient disease states, however, may not be found at the granularity of cell type alone. In fact, appreciation of a finer resolution the manifold would reveal subsets that may be predictive of outcome. This phenomenon is found across biomedical data science, as the cellular state space is known to form a collection of sub-manifolds that disease status can differentially affect.
14+
15+
The goal of Multiscale PHATE is to learn and visualize abstract cellular features and groupings of the data at all levels of granularity in an efficient manner to identify meaningful resolutions. Our approach learns a tree of data granularities which can be cut at coarse levels for high level summarizations of data as well as at fine levels for detailed representations on subsets. Our algorithm is based on a dynamic process we have developed called diffusion condensation, that computes a manifold-intrinsic diffusion space on the original data before slowly condensing data points towards local centers of gravity to form natural, data-driven groupings across multiple granularities. While this may sound computationally inefficient, we show that we are able to perform these calculations as well as visualize and cluster the data significantly faster than “single-scale” visualization techniques like tSNE, UMAP or PHATE, allowing the analysis of millions of cells within minutes. When combined with other computational algorithms for high dimensional data analysis, such as MELD, DREMI and TrajcetoryNet, Multiscale PHATE is able to provide deep and detailed insights in biological processes.
1216

1317
Installation
1418
------------
1519

16-
Multiscale_PHATE is available on `pip`. Install by running the following in a terminal:
20+
Multiscale PHATE is available on `pip`. Install by running the following in a terminal:
1721

1822
```
1923
pip install --user git+https://github.com/KrishnaswamyLab/Multiscale_PHATE
2024
```
2125

22-
Quick start
26+
Quick Start
2327
-----------
2428

2529
```
26-
import numpy as np
27-
X = np.random.normal(0, 1, (100, 10))
28-
2930
import multiscale_phate
3031
mp_op = multiscale_phate.Multiscale_PHATE()
31-
hp_embedding, cluster_viz, sizes_viz, tree = mp_op.fit_transform(X)
32+
mp_embedding, mp_clusters, mp_sizes, tree = mp_op.fit_transform(X)
3233
3334
# Plot optimal visualization
34-
scprep.plot.scatter2d(hp_embedding, s = sizes_viz, c = cluster_viz,
35-
fontsize=16, ticks=False,label_prefix="Multiscale-PHATE", figsize=(16,12))
35+
scprep.plot.scatter2d(mp_embedding, s = mp_sizes, c = mp_clusters,
36+
fontsize=16, ticks=False,label_prefix="Multiscale PHATE", figsize=(16,12))
37+
```
3638

37-
# Plot condensation tree
38-
scprep.plot.scatter3d(tree, c=tree[:,2],fontsize=16, ticks=False, label_prefix="C-PHATE", figsize=(16,12), s=20)
39+
Guided Tutorial
40+
-----------
3941

40-
# Embed online data
41-
Y = np.random.normal(0.5, 1, (50, 10))
42-
hp_embedding, cluster_viz, sizes_viz, tree = mp_op.transform(Y)
43-
```
42+
For more details on using Multiscale PHATE, see our [guided tutorial](tutorial/10X_pbmc.ipynb) using 10X's public PBMC4k dataset.

multiscale_phate/compress.py

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -33,9 +33,10 @@ def get_compression_features(N, features, n_pca, partitions, landmarks):
3333
if n_pca > 100:
3434
n_pca = 100
3535

36+
n_pca = 100
37+
3638
# if N<100000:
3739
# partitions=None
38-
3940
if partitions != None and partitions >= N:
4041
partitions = None
4142

@@ -47,7 +48,7 @@ def get_compression_features(N, features, n_pca, partitions, landmarks):
4748
return n_pca, partitions
4849

4950

50-
def cluster_components(data_subset, num_cluster, size):
51+
def cluster_components(data_subset, num_cluster, size, random_state=None):
5152
"""Short summary.
5253
5354
Parameters
@@ -58,6 +59,10 @@ def cluster_components(data_subset, num_cluster, size):
5859
Description of parameter `num_cluster`.
5960
size : type
6061
Description of parameter `size`.
62+
random_state : integer or numpy.RandomState, optional, default: None
63+
The generator used to initialize MiniBatchKMeans.
64+
If an integer is given, it fixes the seed.
65+
Defaults to the global `numpy` random number generator
6166
6267
Returns
6368
-------
@@ -80,11 +85,12 @@ def cluster_components(data_subset, num_cluster, size):
8085
n_init=10,
8186
max_no_improvement=10,
8287
verbose=0,
88+
random_state=random_state,
8389
).fit(data_subset)
8490
return mbk.labels_
8591

8692

87-
def subset_data(data, desired_num_clusters, n_jobs, num_cluster=100):
93+
def subset_data(data, desired_num_clusters, n_jobs, num_cluster=100, random_state=None):
8894
"""Short summary.
8995
9096
Parameters
@@ -97,6 +103,10 @@ def subset_data(data, desired_num_clusters, n_jobs, num_cluster=100):
97103
Description of parameter `n_jobs`.
98104
num_cluster : type
99105
Description of parameter `num_cluster`.
106+
random_state : integer or numpy.RandomState, optional, default: None
107+
The generator used to initialize MiniBatchKMeans.
108+
If an integer is given, it fixes the seed.
109+
Defaults to the global `numpy` random number generator
100110
101111
Returns
102112
-------
@@ -115,6 +125,7 @@ def subset_data(data, desired_num_clusters, n_jobs, num_cluster=100):
115125
n_init=10,
116126
max_no_improvement=10,
117127
verbose=0,
128+
random_state=random_state,
118129
).fit(data)
119130

120131
clusters = mbk.labels_
@@ -128,6 +139,7 @@ def subset_data(data, desired_num_clusters, n_jobs, num_cluster=100):
128139
data[np.where(clusters == clusters_unique[i])[0], :],
129140
num_cluster,
130141
size,
142+
random_state=random_state,
131143
)
132144
for i in range(len(clusters_unique))
133145
)

multiscale_phate/condense.py

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@ def compute_condensation_param(X, granularity):
8585
return epsilon, merge_threshold
8686

8787

88-
def condense(X, clusters, scale, epsilon, merge_threshold, n_jobs):
88+
def condense(X, clusters, scale, epsilon, merge_threshold, n_jobs, random_state=None):
8989
"""Short summary.
9090
9191
Parameters
@@ -102,6 +102,10 @@ def condense(X, clusters, scale, epsilon, merge_threshold, n_jobs):
102102
Description of parameter `merge_threshold`.
103103
n_jobs : type
104104
Description of parameter `n_jobs`.
105+
random_state : integer or numpy.RandomState, optional, default: None
106+
The generator used to initialize graphtools.
107+
If an integer is given, it fixes the seed.
108+
Defaults to the global `numpy` random number generator
105109
106110
Returns
107111
-------
@@ -141,7 +145,11 @@ def condense(X, clusters, scale, epsilon, merge_threshold, n_jobs):
141145
while len(merge_pairs) == 0:
142146
epsilon = scale * epsilon
143147
G = graphtools.Graph(
144-
X_1, knn=min(X_1.shape[0] - 2, 5), bandwidth=epsilon, n_jobs=n_jobs
148+
X_1,
149+
knn=min(X_1.shape[0] - 2, 5),
150+
bandwidth=epsilon,
151+
n_jobs=n_jobs,
152+
random_state=random_state,
145153
)
146154

147155
P_s = G.P.toarray()

multiscale_phate/diffuse.py

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,9 @@
66
from . import compress
77

88

9-
def compute_diffusion_potential(data, N, decay, gamma, knn, landmarks=2000, n_jobs=10):
9+
def compute_diffusion_potential(
10+
data, N, decay, gamma, knn, landmarks=2000, n_jobs=10, random_state=None
11+
):
1012
"""Short summary.
1113
1214
Parameters
@@ -25,6 +27,10 @@ def compute_diffusion_potential(data, N, decay, gamma, knn, landmarks=2000, n_jo
2527
Description of parameter `landmarks`.
2628
n_jobs : type
2729
Description of parameter `n_jobs`.
30+
random_state : integer or numpy.RandomState, optional, default: None
31+
The generator used to initialize PHATE and PCA.
32+
If an integer is given, it fixes the seed.
33+
Defaults to the global `numpy` random number generator
2834
2935
Returns
3036
-------
@@ -40,15 +46,16 @@ def compute_diffusion_potential(data, N, decay, gamma, knn, landmarks=2000, n_jo
4046
diff_op = phate.PHATE(
4147
verbose=False,
4248
n_landmark=landmarks,
43-
n_pca=None,
4449
decay=decay,
4550
gamma=gamma,
51+
n_pca=None,
4652
knn=knn,
4753
n_jobs=n_jobs,
54+
random_state=random_state,
4855
)
4956
diff_op.fit(data)
5057

51-
pca = sklearn.decomposition.PCA(n_components=25)
58+
pca = sklearn.decomposition.PCA(n_components=25, random_state=random_state)
5259
diff_potential_pca = pca.fit_transform(diff_op.diff_potential)
5360

5461
return (

multiscale_phate/embed.py

Lines changed: 72 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
import numpy as np
22
import phate
3+
import tasklogger
34

45

56
def repulsion(temp):
@@ -68,6 +69,7 @@ def compute_gradient(Xs, merges):
6869
Description of returned object.
6970
7071
"""
72+
tasklogger.log_info("Computing gradient...")
7173
gradient = []
7274
m = 0
7375
X = Xs[0]
@@ -86,6 +88,65 @@ def compute_gradient(Xs, merges):
8688
return np.array(gradient)
8789

8890

91+
def get_levels(grad):
92+
"""Short summary.
93+
94+
Parameters
95+
----------
96+
grad : type
97+
Description of parameter `Xs`.
98+
99+
Returns
100+
-------
101+
type
102+
Description of returned object.
103+
104+
105+
"""
106+
tasklogger.log_info("Identifying salient levels of resolution...")
107+
minimum = np.max(grad)
108+
levels = []
109+
levels.append(0)
110+
111+
for i in range(1, len(grad) - 1):
112+
if grad[i] <= minimum and grad[i] < grad[i + 1]:
113+
levels.append(i)
114+
minimum = grad[i]
115+
return levels
116+
117+
118+
def get_zoom_visualization(
119+
Xs,
120+
NxTs,
121+
zoom_visualization_level,
122+
zoom_cluster_level,
123+
coarse_cluster_level,
124+
coarse_cluster,
125+
n_jobs,
126+
random_state=None,
127+
):
128+
"""Short summary
129+
130+
Parameters
131+
----------
132+
133+
random_state : integer or numpy.RandomState, optional, default: None
134+
The generator used to initialize MDS.
135+
If an integer is given, it fixes the seed.
136+
Defaults to the global `numpy` random number generator
137+
"""
138+
139+
unique = np.unique(
140+
NxTs[zoom_visualization_level], return_index=True, return_counts=True
141+
)
142+
extract = NxTs[coarse_cluster_level][unique[1]] == coarse_cluster
143+
144+
subset_X = Xs[zoom_visualization_level]
145+
embedding = phate.mds.embed_MDS(subset_X[extract], n_jobs=n_jobs, seed=random_state)
146+
147+
return embedding, NxTs[zoom_cluster_level][unique[1]][extract], unique[2][extract]
148+
149+
89150
def compute_ideal_visualization_layer(gradient, Xs, min_cells=100):
90151
"""Short summary.
91152
@@ -117,9 +178,12 @@ def compute_ideal_visualization_layer(gradient, Xs, min_cells=100):
117178
return min_layer
118179

119180

120-
def get_clusters_sizes_2(clusters_full, layer, NxT, X, repulse=False, n_jobs=10):
181+
def get_clusters_sizes_2(
182+
clusters_full, layer, NxT, X, repulse=False, n_jobs=10, random_state=None
183+
):
121184
"""Short summary.
122185
186+
Parameters
123187
Parameters
124188
----------
125189
clusters_full : type
@@ -134,6 +198,10 @@ def get_clusters_sizes_2(clusters_full, layer, NxT, X, repulse=False, n_jobs=10)
134198
Description of parameter `repulse`.
135199
n_jobs : type
136200
Description of parameter `n_jobs`.
201+
random_state : integer or numpy.RandomState, optional, default: None
202+
The generator used to initialize MDS.
203+
If an integer is given, it fixes the seed.
204+
Defaults to the global `numpy` random number generator
137205
138206
Returns
139207
-------
@@ -149,7 +217,7 @@ def get_clusters_sizes_2(clusters_full, layer, NxT, X, repulse=False, n_jobs=10)
149217
subset_X = X[layer]
150218

151219
if repulse:
152-
embedding = phate.mds.embed_MDS(repulsion(subset_X.copy()), n_jobs=n_jobs)
153-
else:
154-
embedding = phate.mds.embed_MDS(subset_X, n_jobs=n_jobs)
220+
subset_X = repulsion(subset_X.copy())
221+
222+
embedding = phate.mds.embed_MDS(subset_X, n_jobs=n_jobs, seed=random_state)
155223
return embedding, clusters_full[unique[1]], unique[2]

0 commit comments

Comments
 (0)