-
Notifications
You must be signed in to change notification settings - Fork 63
Open
Description
Hello everyone,
I try to integrate around 500 000 cells from around a dozen GSE datasets. I'm not sure however how to asses the optimal number of parameters, the results I get are more or less ok, but not great. I will be very gratefull for clarifying sever things and suggesting what to do and what to avoid.
- I assume we should rely only on HVGs, aroung 2000-5000 I guess, not on all genes? Should the count matrix be normalized?
- Does the number of total and pre-training epochs affect significantly the final embedding? For example, will changing the default 100 total and 70 pre to 200 and 150 improve noticeably the result?
- Is there any rule of thumb on how to asses the number of embedding and latent dimensions? Is 50 and 20 a reasonable choice? Or maybe a 50 and 50? I have the intuition for regular PC or Harmony dimensionality selection, but you neural network has fundamentally different nature, especially the latent dimensionality is a mystery for me.
- I don't want to transfer any labels, so I removed cell_type_keys parameter from the scPoli function, a set labeled_indices to []. This is not a correct approach right, we need to set the labeled_indices anyway (scPoli Model for Unsupervised Use #224)?
My code looks like this:
scpoli_model = scPoli(
adata=new_adata,
condition_keys=['GSE'],
# cell_type_keys=cell_type_key,
embedding_dims=50,
latent_dim=20,
recon_loss='nb',
)
scpoli_model.train(
n_epochs=100,
pretraining_epochs=70,
early_stopping_kwargs=early_stopping_kwargs,
eta=5,
)
scpoli_query = scPoli.load_query_data(
adata=new_adata,
reference_model=scpoli_model,
labeled_indices=[],
)
data_latent= scpoli_query.get_latent(new_adata, mean=True)
Metadata
Metadata
Assignees
Labels
No labels