Skip to content

Commit 209c5b3

Browse files
committed
- Added pipeline and queue support
1 parent 43fe1ac commit 209c5b3

10 files changed

+349
-107
lines changed

README.md

Lines changed: 43 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -95,6 +95,8 @@ PIPEX runs sequentially a series of commands written in the file `pipex_batch_li
9595

9696
There are currently available the following commands:
9797

98+
- `run_id <run identifier>` : an optional command to number your current PIPEX run with an identifier. Useful if you want to integrate PIPEX in a bigger pipeline, like PyShowwwcase. **OBS**: PIPEX will assign a random run_id if you don't add this command.
99+
98100
- `swap <number of GB>` : Linux only, it will generate a temporary swap space in the installation folder with the specified size; the space will be automatically deleted at the end of the full PIPEX process. **OBS**: it will require root permission/password while executing.
99101

100102
- `segmentation.py` : performs PIPEX segmentation. Uses the following parameters:
@@ -254,14 +256,24 @@ If you add the `generate_geojson` command to PIPEX command list a `cell_segmenta
254256

255257

256258
TissUUmaps integration
257-
------------------
259+
----------------------
258260

259261
If you add the `generate_tissuumaps` command to PIPEX command list a `anndata_TissUUmaps.h5ad` file will be generated in your analysis/downstream sub-folder. You can open this file in TissUUmaps. To do so:
260262
- Install TissUUmaps (https://tissuumaps.github.io/TissUUmaps-docs/docs/intro/installation.html)
261263
- Load the `anndata_TissUUmaps.h5ad` file in TissUUmaps
262264

263265
If you add the `include_html=yes` parameter to the `generate_tissuumaps` command, a `TissUUmaps_webexport` folder will be generated in your analysis/downstream sub-folder. You can share this file on a web server, and access it from any web browser.
264266

267+
**NOTE**: TissUUmaps requires your images to be in `TIFF` format and be name exactly as your markers (for example: `DAPI.tif`, `CPEP.tif`, etc...)
268+
269+
270+
Pipeline integration
271+
--------------------
272+
273+
PIPEX can be integrated as a step in a bigger pipeline or queue.
274+
- By default, a random `run_id` is assigned to every PIPEX operations batch and a file named `LAST_RUN_ID` containing the same identifier is generated in the root folder once the process is finished.
275+
- You can add a file in the root folder named `run_id.txt` containing a specific identifier if you want to force PIPEX to use it for the next run. The `LAST_RUN_ID` file will be updated accordingly when the process is finished.
276+
- You can also directly specify a run identifier by the PIPEX command `run_id`
265277

266278

267279
Annex 1: Detailed segmentation explanation
@@ -403,4 +415,33 @@ PIPEX's analysis step includes an optional marker filtration commonly used in Ce
403415
- `CDH1` 1% top ranked intensities cell removal
404416
- `CTNNB1` 1% top ranked intensities cell removal
405417

406-
Please make sure you the name of your marker column is a stric match with the aforementioned ones
418+
Please make sure you the name of your marker column is a strict match with the aforementioned ones
419+
420+
421+
Annex 4: Cluster refinement procedure
422+
-------------------------------------
423+
424+
PIPEX's analysis step includes the possibility to refine the unsupervised clustering results (leiden and/or kmeans). This can help you with the manual annotation and merging of the clusters automatically discovered.
425+
426+
The idea behind the cluster refinement algorithm is to explore the ranked genes associated to each cluster and try to match them with rules stated by the user. The algorithm then assigns a confidence score per cluster and rule, depending how close its ranked genes are to the rule/s definition/s. Finally, the refinement picks per cluster the annotated cluster with higher confidence (ties are solved by row order).
427+
428+
To use the cluster refinement, you have to create a `cell_types.csv` file with rows containing the following information:
429+
- `cell_group`: used as a prefix for the manually annotated cluster name. The final cluster name will be `[cell_group]-[cell_type]-[cell_subtype]`
430+
- `cell_type`: used as a interfix for the manually annotated cluster name. The final cluster name will be `[cell_group]-[cell_type]-[cell_subtype]`
431+
- `cell_subtype`: used as a suffix for the manually annotated cluster name. The final cluster name will be `[cell_group]-[cell_type]-[cell_subtype]`
432+
- `rank_filter`: used to direct the refinement procedure to use only certain ranked genes. Default is `all` (no filtering), you can use `positive_only` (use only ranked genes with positive values)
433+
- `min_confidence`: by default, the refinement procedure aggresively merges all clusters that minimally fullfil the indicated rules. You can force the process to be more strict by using a higher `min_confidence` probability (values from 0 to 100)
434+
- `marker[n]` and `rule[n]` pairs: you can add an arbritary amount (at least one!) of marker rules to guide the algorithm how to annotate/merge the automatically discovered clusters. The marker value must match one of your analysis markers and the rule states how the marker should be relatively placed amongst the ranked genes (values `high`,`medium`,`low`)
435+
436+
Here's and example of how a `cell_types.csv` file usually looks:
437+
<code>
438+
439+
cell_group,cell_type,cell_subtype,rank_filter,min_confidence,marker1,rule1,marker2,rule2,marker3,rule3
440+
artifact,fold,unknown,all,10,CBS,high,CHGA,high,AMY2B,high
441+
endocrine,islet,all,positive_only,10,CHGA,high,CPEP,high,AMY2B,low
442+
exocrine,acinar,unknown1,all,10,CBS,high,AMY2B,high
443+
endothelial,vessels,all,positive_only,30,CD31,high,aSMA,high
444+
epithelial,ductal,unknown,all,10,KRT19,high,PANCK,high
445+
immune,potential,artifact,all,10,HLADR,high,NPDC1,high,aSMA,low
446+
447+
</code>

analysis.py

Lines changed: 77 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -47,14 +47,7 @@ def data_calculations():
4747
#Getting the list of marker names
4848
markers = list(df_norm.columns.values)
4949
markers = markers[(df_norm.columns.get_loc("y") + 1):]
50-
#saveguard if analysis.py has been executed before and cluster_id + cluster_color already exists
51-
if any("_bin_thres" in s for s in markers):
52-
markers = markers[:-(len(df_norm.columns) - df_norm.columns.get_loc(list(filter(lambda x: '_bin_thres' in x, markers))[0]))]
53-
else:
54-
if 'leiden' in markers:
55-
markers = markers[:-(len(df_norm.columns) - df_norm.columns.get_loc("leiden"))]
56-
elif 'kmeans' in markers:
57-
markers = markers[:-(len(df_norm.columns) - df_norm.columns.get_loc("kmeans"))]
50+
markers = [x for x in markers if "memref" not in x and "_bin" not in x and "_ratio_pixels" not in x and "_local_90" not in x and "leiden" not in x and "kmeans" not in x and "_ref" not in x]
5851

5952
#If a specific list of markers is informed, we use it
6053
if len(analysis_markers) > 0:
@@ -278,41 +271,47 @@ def generate_leiden_color(leiden_id, leiden_color_list):
278271
return ''
279272

280273

281-
def check_cell_type_threshold(curr_marker, curr_rule, curr_score, high_threshold, low_threshold):
274+
def check_cell_type_threshold(curr_marker, curr_rule, curr_score, high_threshold, low_threshold, rank_filter):
282275
if curr_rule == 'high':
283276
if curr_score >= high_threshold:
284-
return 100
277+
return 100, "high"
285278
elif curr_score > low_threshold:
286-
return 50
279+
return 50, "medium"
287280
elif curr_rule == 'low':
288281
if curr_score <= low_threshold:
289-
return 100
282+
if rank_filter == "all" or curr_score >= 0:
283+
return 100, "low"
290284
elif curr_score < high_threshold:
291-
return 50
285+
return 50, "medium"
292286
else:
293287
if curr_score > low_threshold and curr_score < high_threshold:
294-
return 100
288+
return 100, "medium"
289+
elif curr_score >= high_threshold:
290+
return 50, "high"
295291
else:
296-
return 50
297-
return 0
292+
if rank_filter == "all" or curr_score >= 0:
293+
return 50, "low"
294+
return 0, "none"
298295

299296

300-
def check_cell_type(row, cluster_id, clustering_merge_data):
301-
high_threshold = clustering_merge_data['scores'][cluster_id]['q75']
302-
low_threshold = clustering_merge_data['scores'][cluster_id]['q25']
297+
def check_cell_type(row, cluster_id, clustering_merge_data, rank_filter):
298+
high_threshold = clustering_merge_data['scores'][cluster_id]['rank_filter']['positive_only']['q75'] if rank_filter == "positive_only" else clustering_merge_data['scores'][cluster_id]['rank_filter']['all']['q75']
299+
low_threshold = clustering_merge_data['scores'][cluster_id]['rank_filter']['positive_only']['q25'] if rank_filter == "positive_only" else clustering_merge_data['scores'][cluster_id]['rank_filter']['all']['q25']
303300
final_score = 0
301+
rule_match = {}
304302
num_rules = 1
305303
while ('marker' + str(num_rules)) in row and not pd.isnull(row['marker' + str(num_rules)]):
306304
curr_marker = row['marker' + str(num_rules)]
307305
curr_rule = row['rule' + str(num_rules)]
308306
num_rules = num_rules + 1
309307
if curr_marker in clustering_merge_data['scores'][cluster_id]['markers']:
310308
curr_score = clustering_merge_data['scores'][cluster_id]['markers'][curr_marker]
311-
final_score = check_cell_type_threshold(curr_marker, curr_rule, curr_score, high_threshold, low_threshold)
309+
final_score, marker_level = check_cell_type_threshold(curr_marker, curr_rule, curr_score, high_threshold, low_threshold, rank_filter)
310+
rule_match[curr_marker] = marker_level
312311
if final_score > 0:
313-
return final_score / (num_rules - 1)
312+
return final_score / (num_rules - 1), rule_match
314313

315-
return None
314+
return None, None
316315

317316

318317
def calculate_cluster_info(adata, cluster_type):
@@ -371,7 +370,8 @@ def refine_clustering(adata, cluster_type):
371370
clustering_merge_data = {}
372371
clustering_merge_data['scores'] = {}
373372
clustering_merge_data['cell_types'] = {}
374-
cluster_dif_list = []
373+
cluster_dif_list_all = []
374+
cluster_dif_list_positive = []
375375
for cluster_id in adata.obs[cluster_type].unique():
376376
cluster_score_list = []
377377
cluster_merge_clusters_scores = {}
@@ -381,24 +381,60 @@ def refine_clustering(adata, cluster_type):
381381
curr_score = rank_df[rank_df['names'] == marker_id]['scores'].values[0]
382382
cluster_merge_clusters_scores['markers'][marker_id] = float(curr_score)
383383
cluster_score_list.append(curr_score)
384-
cluster_merge_clusters_scores['score_max'] = float(max(cluster_score_list))
385-
cluster_merge_clusters_scores['score_min'] = float(min(cluster_score_list))
386-
cluster_merge_clusters_scores['score_dif'] = cluster_merge_clusters_scores['score_max'] - cluster_merge_clusters_scores['score_min']
387-
cluster_dif_list.append(cluster_merge_clusters_scores['score_dif'])
388-
cluster_merge_clusters_scores['q75'] = float((cluster_merge_clusters_scores['score_max'] - cluster_merge_clusters_scores['score_min']) * 75 / 100 + cluster_merge_clusters_scores['score_min'])
389-
cluster_merge_clusters_scores['q25'] = float((cluster_merge_clusters_scores['score_max'] - cluster_merge_clusters_scores['score_min']) * 25 / 100 + cluster_merge_clusters_scores['score_min'])
384+
385+
cluster_merge_clusters_scores['rank_filter'] = {}
386+
387+
cluster_merge_clusters_scores['rank_filter']['all'] = {}
388+
cluster_merge_clusters_scores['rank_filter']['all']['score_max'] = float(max(cluster_score_list))
389+
cluster_merge_clusters_scores['rank_filter']['all']['score_min'] = float(min(cluster_score_list))
390+
cluster_merge_clusters_scores['rank_filter']['all']['score_dif'] = cluster_merge_clusters_scores['rank_filter']['all']['score_max'] - cluster_merge_clusters_scores['rank_filter']['all']['score_min']
391+
cluster_dif_list_all.append(cluster_merge_clusters_scores['rank_filter']['all']['score_dif'])
392+
cluster_merge_clusters_scores['rank_filter']['all']['q75'] = float((cluster_merge_clusters_scores['rank_filter']['all']['score_max'] - cluster_merge_clusters_scores['rank_filter']['all']['score_min']) * 75 / 100 + cluster_merge_clusters_scores['rank_filter']['all']['score_min'])
393+
cluster_merge_clusters_scores['rank_filter']['all']['q25'] = float((cluster_merge_clusters_scores['rank_filter']['all']['score_max'] - cluster_merge_clusters_scores['rank_filter']['all']['score_min']) * 25 / 100 + cluster_merge_clusters_scores['rank_filter']['all']['score_min'])
394+
395+
cluster_score_list_positive = list(filter(lambda x: x >= 0, cluster_score_list))
396+
if len(cluster_score_list_positive) > 0:
397+
cluster_merge_clusters_scores['rank_filter']['positive_only'] = {}
398+
cluster_merge_clusters_scores['rank_filter']['positive_only']['score_max'] = float(max(cluster_score_list_positive))
399+
cluster_merge_clusters_scores['rank_filter']['positive_only']['score_min'] = float(min(cluster_score_list_positive))
400+
cluster_merge_clusters_scores['rank_filter']['positive_only']['score_dif'] = cluster_merge_clusters_scores['rank_filter']['positive_only']['score_max'] - cluster_merge_clusters_scores['rank_filter']['positive_only']['score_min']
401+
cluster_dif_list_positive.append(cluster_merge_clusters_scores['rank_filter']['positive_only']['score_dif'])
402+
cluster_merge_clusters_scores['rank_filter']['positive_only']['q75'] = float((cluster_merge_clusters_scores['rank_filter']['positive_only']['score_max'] - cluster_merge_clusters_scores['rank_filter']['positive_only']['score_min']) * 75 / 100 + cluster_merge_clusters_scores['rank_filter']['positive_only']['score_min'])
403+
cluster_merge_clusters_scores['rank_filter']['positive_only']['q25'] = float((cluster_merge_clusters_scores['rank_filter']['positive_only']['score_max'] - cluster_merge_clusters_scores['rank_filter']['positive_only']['score_min']) * 25 / 100 + cluster_merge_clusters_scores['rank_filter']['positive_only']['score_min'])
404+
390405
clustering_merge_data['scores'][cluster_id] = cluster_merge_clusters_scores
391406
clustering_merge_data['cell_types'][cluster_id] = []
392-
clustering_merge_data['dif_median'] = float(statistics.median(cluster_dif_list))
407+
clustering_merge_data['dif_median_all'] = float(statistics.median(cluster_dif_list_all))
408+
if len(cluster_dif_list_positive) > 0:
409+
clustering_merge_data['dif_median_positive'] = float(statistics.median(cluster_dif_list_positive))
393410

394411
cell_types = pd.read_csv(data_folder + '/cell_types.csv')
395412
for index, row in cell_types.iterrows():
396413
for cluster_id in clustering_merge_data['scores']:
397-
cell_type_prob = check_cell_type(row, cluster_id, clustering_merge_data)
398-
if cell_type_prob is not None:
399-
if clustering_merge_data['scores'][cluster_id]['score_dif'] < clustering_merge_data['dif_median']:
400-
cell_type_prob = cell_type_prob * clustering_merge_data['scores'][cluster_id]['score_dif'] / clustering_merge_data['dif_median']
401-
clustering_merge_data['cell_types'][cluster_id].append({ 'cell_type' : row['cell_group'] + '.' + row['cell_type'] + '.' + row['cell_subtype'], 'prob' : cell_type_prob, 'confidence_threshold' : row['min_confidence']})
414+
if row['rank_filter'] != "positive_only" or "positive_only" in clustering_merge_data['scores'][cluster_id]['rank_filter']:
415+
cell_type_prob, marker_ranks = check_cell_type(row, cluster_id, clustering_merge_data, row['rank_filter'])
416+
if cell_type_prob is not None:
417+
curr_final_merging_data = {'cell_type': row['cell_group'] + '.' + row['cell_type'] + '.' + row['cell_subtype'], 'prob': cell_type_prob, 'rank_filter': row['rank_filter'], 'confidence_threshold': row['min_confidence']}
418+
if row['rank_filter'] == "positive_only":
419+
curr_final_merging_data['score_vs_median'] = str(clustering_merge_data['scores'][cluster_id]['rank_filter']['positive_only']['score_dif']) + " / " + str(clustering_merge_data['dif_median_positive'])
420+
cluster_marker_ranks = ""
421+
for curr_marker_rank in marker_ranks:
422+
cluster_marker_ranks = cluster_marker_ranks + curr_marker_rank + ":" + marker_ranks[curr_marker_rank] + ","
423+
curr_final_merging_data['marker_ranks'] = cluster_marker_ranks[:-1]
424+
if clustering_merge_data['scores'][cluster_id]['rank_filter']['positive_only']['score_dif'] < clustering_merge_data['dif_median_positive']:
425+
cell_type_prob = cell_type_prob * clustering_merge_data['scores'][cluster_id]['rank_filter']['positive_only']['score_dif'] / clustering_merge_data['dif_median_positive']
426+
curr_final_merging_data['prob'] = cell_type_prob
427+
clustering_merge_data['cell_types'][cluster_id].append(curr_final_merging_data)
428+
else:
429+
curr_final_merging_data['score_vs_median'] = str(clustering_merge_data['scores'][cluster_id]['rank_filter']['all']['score_dif']) + " / " + str(clustering_merge_data['dif_median_all'])
430+
cluster_marker_ranks = ""
431+
for curr_marker_rank in marker_ranks:
432+
cluster_marker_ranks = cluster_marker_ranks + curr_marker_rank + ":" + marker_ranks[curr_marker_rank] + ","
433+
curr_final_merging_data['marker_ranks'] = cluster_marker_ranks[:-1]
434+
if clustering_merge_data['scores'][cluster_id]['rank_filter']['all']['score_dif'] < clustering_merge_data['dif_median_all']:
435+
cell_type_prob = cell_type_prob * clustering_merge_data['scores'][cluster_id]['rank_filter']['all']['score_dif'] / clustering_merge_data['dif_median_all']
436+
curr_final_merging_data['prob'] = cell_type_prob
437+
clustering_merge_data['cell_types'][cluster_id].append(curr_final_merging_data)
402438
clustering_merge_data['candidates'] = {}
403439
adata.obs[cluster_type + "_ref"] = adata.obs[cluster_type].astype(str)
404440
adata.obs[cluster_type + "_ref_p"] = adata.obs[cluster_type].astype(str)
@@ -465,11 +501,6 @@ def clustering(df_norm, markers):
465501
plt.clf()
466502
plt.close()
467503

468-
#standard_embedding = umap.UMAP(random_state=0,low_memory=True).fit_transform(adata.obsm['X_pca'])
469-
#plt.scatter(standard_embedding[:, 0], standard_embedding[:, 1], s=0.1, cmap='Spectral');
470-
#plt.savefig('umaptest.jpg')
471-
#sys.exit(0)
472-
473504
if (leiden == 'yes'):
474505
#We calculate leiden cluster
475506
sc.tl.leiden(adata)
@@ -483,6 +514,7 @@ def clustering(df_norm, markers):
483514
refine_clustering(adata, 'leiden')
484515
calculate_cluster_info(adata, "leiden_ref")
485516
except Exception as e:
517+
print(e)
486518
print(">>> Failed at refining leiden cluster =", datetime.datetime.now().strftime("%d/%m/%Y %H:%M:%S"), flush=True)
487519

488520
if (kmeans == 'yes'):
@@ -544,6 +576,7 @@ def clustering(df_norm, markers):
544576
refine_clustering(adata, 'kmeans')
545577
calculate_cluster_info(adata, "kmeans_ref")
546578
except Exception as e:
579+
print(e)
547580
print(">>> Failed at refining kmeans cluster =", datetime.datetime.now().strftime("%d/%m/%Y %H:%M:%S"), flush=True)
548581

549582
if (leiden == 'yes' or kmeans == 'yes'):
@@ -659,6 +692,10 @@ def options(argv):
659692
with open(pidfile_filename, 'w', encoding='utf-8') as f:
660693
f.write(str(os.getpid()))
661694
f.close()
695+
with open(data_folder + '/log_settings_analysis.txt', 'w+', encoding='utf-8') as f:
696+
f.write(">>> Start time analysis = " + datetime.datetime.now().strftime(" %H:%M:%S_%d/%m/%Y") + "\n")
697+
f.write(' '.join(sys.argv))
698+
f.close()
662699

663700
print(">>> Start time analysis =", datetime.datetime.now().strftime("%d/%m/%Y %H:%M:%S"), flush=True)
664701

0 commit comments

Comments
 (0)