Skip to content

Implementation of CellPLM #19

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

HelloWorldLTY
Copy link

@HelloWorldLTY HelloWorldLTY commented Dec 14, 2024

Describe your changes

Checklist before requesting a review

  • I have performed a self-review of my code

  • Check the correct box. Does this PR contain:

    • Breaking changes
    • New functionality
    • Major changes
    • Minor changes
    • Bug fixes
  • Proposed changes are described in the CHANGELOG.md

  • CI Tests succeed and look good!

Hi, I have some questions about the CI test.

For CellPLM, it is only runnable in gpu, not in cpu. Even if they provide the device keyword, but their pre-trained models are trained based on gpu and we may need to modify their packages.

For scFoundation, I keep on meeting the problem:

File "/tmp/viash-run-scfoundation-5vhnk1.py", line 60, in
sys.path.append(meta["resources_dir"])
KeyError: 'resources_dir'

It is strange and in cellplm or scgpt I did not meet such an error. It will be great if someone can take a look and help me resolve this problem.

Also, scFoundation does not allow us to upload the model weights elsewhere publicly, could we store the model weights internally and load it if possible? Thanks.

@lazappi
Copy link
Member

lazappi commented Dec 17, 2024

Thanks for getting started with this 🎉!

For CellPLM, it is only runnable in gpu, not in cpu. Even if they provide the device keyword, but their pre-trained models are trained based on gpu and we may need to modify their packages.

Running on a GPU is ok. When this is the case the config file for the component should inherit from base_method.yaml instead of comp_method.yaml. This will skip some of the tests on the CI where we only have a CPU.

For scFoundation, I keep on meeting the problem:

File "/tmp/viash-run-scfoundation-5vhnk1.py", line 60, in sys.path.append(meta["resources_dir"]) KeyError: 'resources_dir'

It is strange and in cellplm or scgpt I did not meet such an error. It will be great if someone can take a look and help me resolve this problem.

I think this might be because you have deleted the ## VIASH END tag. This is important for the component to work properly and the section between ## VIASH START and ## VIASH END should only contain dictionaries as in the components for other methods.

Also, scFoundation does not allow us to upload the model weights elsewhere publicly, could we store the model weights internally and load it if possible? Thanks.

Can you please point us to where this is documented? We will take a look and see what is possible.

I noticed you have added a lot of extra files, some of which have copyright notices. The added code should be the minimum needed to run each method and if most cases should be stored in the directory for that component. Can you please remove anything that isn't needed and move it to the appropriate directory?

@HelloWorldLTY
Copy link
Author

HelloWorldLTY commented Dec 17, 2024

Hi, I got this information after directly asking the authors of scFoundation. Do you think if it will be helpful to reach out these authors to see if they allow us to create a google drive link for download? Thanks.

Update: I have made corrections based on your recommendations. For the copyright file, I do not think we can easily remove tham as they are depended files used in scFoundation infernece (check the load.py and ger_embedding.py). Do you think I need to clean them and finally have a file without risk, or just remove this method from our list?

@lazappi
Copy link
Member

lazappi commented Dec 18, 2024

Hi, I got this information after directly asking the authors of scFoundation. Do you think if it will be helpful to reach out these authors to see if they allow us to create a google drive link for download? Thanks.

How do they currently distribute the model? We don't need Google Drive specifically, but we do need some public location where the model is available.

Update: I have made corrections based on your recommendations. For the copyright file, I do not think we can easily remove tham as they are depended files used in scFoundation infernece (check the load.py and ger_embedding.py). Do you think I need to clean them and finally have a file without risk, or just remove this method from our list?

If the files have been copied directly then we should definitely not remove any copyright notices. It sounds like there are some licensing issues around scFoundation that we would need to work out. For now I think it would be better to remove it and just add CellPLM until we are able to look into this more (or open a new PR with just CellPLM if that's easier).

@rcannood Do you have any other thoughts?

@HelloWorldLTY
Copy link
Author

Hi, all options sound good to me.

scFoundation distributs the weights based on onedrive bussiness version, and I tried multiple approach to download the weights via link or in the codes but they all fail. It seems that onedrive has more strict useage.

@lazappi
Copy link
Member

lazappi commented Dec 19, 2024

scFoundation distributs the weights based on onedrive bussiness version, and I tried multiple approach to download the weights via link or in the codes but they all fail. It seems that onedrive has more strict useage.

Ok. Let's change this PR to just have CellPLM or open a new one (whichever is easier) until we can look more into how to handle scFoundation.

@HelloWorldLTY
Copy link
Author

Hi, sound sounds good to me. I have removed the content related to scFoundation.

@HelloWorldLTY HelloWorldLTY changed the title Implementation of CellPLM and scFoundation Implementation of CellPLM Dec 19, 2024
Copy link
Member

@lazappi lazappi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had a closer look at the code and made some comments. Can you please try to address these? Let me know if anything isn't clear.

@HelloWorldLTY
Copy link
Author

I have resolved these comments. Please let mw know if you have more questions. Thanks a lot.

@HelloWorldLTY
Copy link
Author

Hi, I have made necessary adjustment of the codes. For the instructions of file reading, if you prefer having a local file, I can adjust it, and maybe in this case I cannot test the codes in my local computer.

@lazappi
Copy link
Member

lazappi commented Jan 8, 2025

Hi, I have made necessary adjustment of the codes. For the instructions of file reading, if you prefer having a local file, I can adjust it, and maybe in this case I cannot test the codes in my local computer.

I think it still downloads from Google Drive? Are there maybe some updates missing.

For using a local file you could look at the SCimilarity implementation as a example. It should be pretty similar for testing locally, you would just need to download and save the model somewhere rather than having it done by the script.

@HelloWorldLTY
Copy link
Author

Hi, I have updated the codes with local file checking. Does this look good? Maybe you can manually download models with different versions and test them in cloud.

@lazappi
Copy link
Member

lazappi commented Jan 9, 2025

Thanks! I will do some more testing and make some suggestions/changes.

@lazappi
Copy link
Member

lazappi commented Jan 10, 2025

@HelloWorldLTY I have tested this and it seems to be working 🎉! I made a few changes but it was mostly to tidy things up and make it more consistent with other methods.

I have one questions before we finalise it. You used normalised data as input but the example data in the tutorial seems to be counts. Did you find any documentation for what the input should be?

@HelloWorldLTY
Copy link
Author

Hi @lazappi happy to hear that, and your question is really great and we need to clarify it. Actually they do not have any document-level instructions in the original code, and their shared data have different format (some are count-based, like gse155468, others are normalized-based, like demo_train). I search their codes and found that they have mentioned the log-norm function in the code base, which is used to evaluate reconstruct and denoise effect. Therefore, I think they may expect log-norm data as inputs. See: https://github.com/search?q=repo%3AOmicsML%2FCellPLM%20normalize&type=code. Maybe we can provide codes for both implementation.

@lazappi
Copy link
Member

lazappi commented Jan 13, 2025

Thanks for checking! I had a closer look at the example datasets and it looks like they have counts except for demo_train.h5ad. I asked here OmicsML/CellPLM#22 but if they don't get back to us soon I think we should switch to counts.

@HelloWorldLTY
Copy link
Author

Make sense. Thanks for checking!

@lazappi
Copy link
Member

lazappi commented Jan 24, 2025

The authors replied and said they expect counts as input so we should make that change. It looks like there are some minor conflicts now so I can take care of that as well.

@HelloWorldLTY
Copy link
Author

Thanks a lot. That is fantastic

* origin/main: (28 commits)
  Adjust pyliger dependencies (openproblems-bio#43)
  Minor run adjustments (openproblems-bio#44)
  Fix reference in task description (openproblems-bio#45)
  Remove grouping bottleneck (openproblems-bio#42)
  Retry with single core in geneformer (openproblems-bio#41)
  Adjustments based on benchmark run (openproblems-bio#39)
  Update compute environment (openproblems-bio#35)
  Replace "+" in batch names for liger (openproblems-bio#37)
  Update common submodule to commit 62268aa  (openproblems-bio#38)
  Adjust features filter in scPRINT (openproblems-bio#36)
  minor tweaks to the cell cycle component (openproblems-bio#34)
  Update common submodule to commit a2e845 (openproblems-bio#33)
  Increase time resources (openproblems-bio#27)
  GPU adjustments (openproblems-bio#32)
  Add batch_hvg var column to solution (openproblems-bio#31)
  Add fine-tuned scGPT (openproblems-bio#17)
  Bump scIB to v1.1.7 (openproblems-bio#30)
  Pin scPRINT version (openproblems-bio#25)
  Increase UCE memory tag (openproblems-bio#24)
  Add exit codes helper file (openproblems-bio#26)
  ...
@lazappi
Copy link
Member

lazappi commented Jan 28, 2025

@rcannood I think this is ready for you to take a look at

@lazappi lazappi requested a review from rcannood January 28, 2025 06:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants