-
Notifications
You must be signed in to change notification settings - Fork 83
Added op3loader #921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added op3loader #921
Conversation
Thanks for contributing this, @Paulos2411 ! Let me know when you think this is ready for a review |
|
||
return filtered_adata | ||
|
||
def filter_by_counts(adata, par): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this filtering correspond to in OP3, @Paulos2411 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be ok nevertheless. In OP3 benchmark it's covered later by filterbyexpr
Hi @rcannood, I think the PR is now ready to review :) |
@Paulos2411 , seems to me that it's still missing the fields we discussed, like hvg, feature_name etc. That is, the obligatory fields from here: https://openproblems.bio/documentation/fundamentals/datasets |
@Paulos2411 this is missing a workflow, like https://github.com/openproblems-bio/openproblems/blob/main/src/datasets/workflows/scrnaseq/process_openproblems_v1/main.nf |
loader is mixed with filtering here. I wonder if filtering wouldn't fit better in "processors". filtering here is specific to OP3, but I suppose it's ok to create a dataset-specific processor |
.gitignore
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you shouldn't be adding your local unnecessary file to .gitignore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding this to a review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution! I made a few comments
.gitignore
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding this to a review
name: op3_loader | ||
namespace: datasets/loaders/scrnaseq | ||
description: | | ||
Loads and preprocesses the OP3 dataset from GEO accession GSE279945. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what level of preprocessing can be left in the "loader", @rcannood ?
import pandas as pd | ||
import numpy as np | ||
import requests | ||
from tqdm import tqdm # liberary for displaying progress bars |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spelling of the comment + this import comment can be skipped
|
||
logger.info("Done") | ||
|
||
if __name__ == "__main__": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to define main as a separate function and calling it here - i.e. the main code should be written directly in scripts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can stay the way it is
assert adata.uns["dataset_description"] == "Test description for OP3 dataset", "Incorrect .uns['dataset_description']" | ||
|
||
|
||
if __name__ == '__main__': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once again just write the script directly in the file - these files are not expected to be ever imported elsewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can have a look at other similar scripts in the repo
name: process_op3 | ||
namespace: datasets/workflows/scrnaseq | ||
description: | | ||
Fetch and process datasets from the Open Problems Perurbation Prediction (OP3) dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"process" can encompass a lot - would be great to make it more specific
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you managed to execute this nextflow pipeline, @Paulos2411 and check the output?
Superseded by #926 |
Describe your changes
Created a new data loader which takes in sc data.
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE279945
Issue ticket number and link
Focuses on the Improvement of the task pertubation prediction
openproblems-bio/task_perturbation_prediction#86
Checklist before requesting a review
I have performed a self-review of my code
Check the correct box. Does this PR contain:
Proposed changes are described in the CHANGELOG.md
CI Tests succeed and look good!
Link: https://openproblems.bio/documentation/advanced_topics/create_a_dataset_loader