-
Notifications
You must be signed in to change notification settings - Fork 9
Implementation of CellPLM #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thanks for getting started with this 🎉!
Running on a GPU is ok. When this is the case the config file for the component should inherit from
I think this might be because you have deleted the
Can you please point us to where this is documented? We will take a look and see what is possible. I noticed you have added a lot of extra files, some of which have copyright notices. The added code should be the minimum needed to run each method and if most cases should be stored in the directory for that component. Can you please remove anything that isn't needed and move it to the appropriate directory? |
Hi, I got this information after directly asking the authors of scFoundation. Do you think if it will be helpful to reach out these authors to see if they allow us to create a google drive link for download? Thanks. Update: I have made corrections based on your recommendations. For the copyright file, I do not think we can easily remove tham as they are depended files used in scFoundation infernece (check the load.py and ger_embedding.py). Do you think I need to clean them and finally have a file without risk, or just remove this method from our list? |
How do they currently distribute the model? We don't need Google Drive specifically, but we do need some public location where the model is available.
If the files have been copied directly then we should definitely not remove any copyright notices. It sounds like there are some licensing issues around scFoundation that we would need to work out. For now I think it would be better to remove it and just add CellPLM until we are able to look into this more (or open a new PR with just CellPLM if that's easier). @rcannood Do you have any other thoughts? |
Hi, all options sound good to me. scFoundation distributs the weights based on onedrive bussiness version, and I tried multiple approach to download the weights via link or in the codes but they all fail. It seems that onedrive has more strict useage. |
Ok. Let's change this PR to just have CellPLM or open a new one (whichever is easier) until we can look more into how to handle scFoundation. |
Hi, sound sounds good to me. I have removed the content related to scFoundation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had a closer look at the code and made some comments. Can you please try to address these? Let me know if anything isn't clear.
I have resolved these comments. Please let mw know if you have more questions. Thanks a lot. |
Hi, I have made necessary adjustment of the codes. For the instructions of file reading, if you prefer having a local file, I can adjust it, and maybe in this case I cannot test the codes in my local computer. |
I think it still downloads from Google Drive? Are there maybe some updates missing. For using a local file you could look at the SCimilarity implementation as a example. It should be pretty similar for testing locally, you would just need to download and save the model somewhere rather than having it done by the script. |
Hi, I have updated the codes with local file checking. Does this look good? Maybe you can manually download models with different versions and test them in cloud. |
Thanks! I will do some more testing and make some suggestions/changes. |
@HelloWorldLTY I have tested this and it seems to be working 🎉! I made a few changes but it was mostly to tidy things up and make it more consistent with other methods. I have one questions before we finalise it. You used normalised data as input but the example data in the tutorial seems to be counts. Did you find any documentation for what the input should be? |
Hi @lazappi happy to hear that, and your question is really great and we need to clarify it. Actually they do not have any document-level instructions in the original code, and their shared data have different format (some are count-based, like gse155468, others are normalized-based, like demo_train). I search their codes and found that they have mentioned the log-norm function in the code base, which is used to evaluate reconstruct and denoise effect. Therefore, I think they may expect log-norm data as inputs. See: https://github.com/search?q=repo%3AOmicsML%2FCellPLM%20normalize&type=code. Maybe we can provide codes for both implementation. |
Thanks for checking! I had a closer look at the example datasets and it looks like they have counts except for |
Make sense. Thanks for checking! |
The authors replied and said they expect counts as input so we should make that change. It looks like there are some minor conflicts now so I can take care of that as well. |
Thanks a lot. That is fantastic |
* origin/main: (28 commits) Adjust pyliger dependencies (openproblems-bio#43) Minor run adjustments (openproblems-bio#44) Fix reference in task description (openproblems-bio#45) Remove grouping bottleneck (openproblems-bio#42) Retry with single core in geneformer (openproblems-bio#41) Adjustments based on benchmark run (openproblems-bio#39) Update compute environment (openproblems-bio#35) Replace "+" in batch names for liger (openproblems-bio#37) Update common submodule to commit 62268aa (openproblems-bio#38) Adjust features filter in scPRINT (openproblems-bio#36) minor tweaks to the cell cycle component (openproblems-bio#34) Update common submodule to commit a2e845 (openproblems-bio#33) Increase time resources (openproblems-bio#27) GPU adjustments (openproblems-bio#32) Add batch_hvg var column to solution (openproblems-bio#31) Add fine-tuned scGPT (openproblems-bio#17) Bump scIB to v1.1.7 (openproblems-bio#30) Pin scPRINT version (openproblems-bio#25) Increase UCE memory tag (openproblems-bio#24) Add exit codes helper file (openproblems-bio#26) ...
@rcannood I think this is ready for you to take a look at |
Describe your changes
Checklist before requesting a review
I have performed a self-review of my code
Check the correct box. Does this PR contain:
Proposed changes are described in the CHANGELOG.md
CI Tests succeed and look good!
Hi, I have some questions about the CI test.
For CellPLM, it is only runnable in gpu, not in cpu. Even if they provide the device keyword, but their pre-trained models are trained based on gpu and we may need to modify their packages.
For scFoundation, I keep on meeting the problem:
File "/tmp/viash-run-scfoundation-5vhnk1.py", line 60, in
sys.path.append(meta["resources_dir"])
KeyError: 'resources_dir'
It is strange and in cellplm or scgpt I did not meet such an error. It will be great if someone can take a look and help me resolve this problem.
Also, scFoundation does not allow us to upload the model weights elsewhere publicly, could we store the model weights internally and load it if possible? Thanks.