Skip to content

Added op3loader #921

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 15 commits into from
Closed

Added op3loader #921

wants to merge 15 commits into from

Conversation

Paulos2411
Copy link

@Paulos2411 Paulos2411 commented Feb 26, 2025

Describe your changes

Created a new data loader which takes in sc data.

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE279945

Issue ticket number and link

Focuses on the Improvement of the task pertubation prediction

openproblems-bio/task_perturbation_prediction#86

Checklist before requesting a review

  • I have performed a self-review of my code

  • Check the correct box. Does this PR contain:

    • Breaking changes
    • New functionality
    • Major changes
    • Minor changes
    • Bug fixes
    • Documentation
  • Proposed changes are described in the CHANGELOG.md

  • CI Tests succeed and look good!

Link: https://openproblems.bio/documentation/advanced_topics/create_a_dataset_loader

@rcannood
Copy link
Member

Thanks for contributing this, @Paulos2411 ! Let me know when you think this is ready for a review ☺️


return filtered_adata

def filter_by_counts(adata, par):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this filtering correspond to in OP3, @Paulos2411 ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be ok nevertheless. In OP3 benchmark it's covered later by filterbyexpr

@Paulos2411 Paulos2411 marked this pull request as ready for review March 18, 2025 17:03
@Paulos2411
Copy link
Author

Hi @rcannood, I think the PR is now ready to review :)

@Paulos2411 Paulos2411 requested a review from szalata March 18, 2025 17:07
@szalata
Copy link

szalata commented Mar 18, 2025

@Paulos2411 , seems to me that it's still missing the fields we discussed, like hvg, feature_name etc. That is, the obligatory fields from here: https://openproblems.bio/documentation/fundamentals/datasets

@szalata
Copy link

szalata commented Mar 18, 2025

@szalata
Copy link

szalata commented Mar 18, 2025

loader is mixed with filtering here. I wonder if filtering wouldn't fit better in "processors". filtering here is specific to OP3, but I suppose it's ok to create a dataset-specific processor

.gitignore Outdated
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you shouldn't be adding your local unnecessary file to .gitignore

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding this to a review

Copy link

@szalata szalata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution! I made a few comments

.gitignore Outdated
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding this to a review

name: op3_loader
namespace: datasets/loaders/scrnaseq
description: |
Loads and preprocesses the OP3 dataset from GEO accession GSE279945.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what level of preprocessing can be left in the "loader", @rcannood ?

import pandas as pd
import numpy as np
import requests
from tqdm import tqdm # liberary for displaying progress bars
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spelling of the comment + this import comment can be skipped


logger.info("Done")

if __name__ == "__main__":
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to define main as a separate function and calling it here - i.e. the main code should be written directly in scripts

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can stay the way it is

assert adata.uns["dataset_description"] == "Test description for OP3 dataset", "Incorrect .uns['dataset_description']"


if __name__ == '__main__':
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once again just write the script directly in the file - these files are not expected to be ever imported elsewhere

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can have a look at other similar scripts in the repo

name: process_op3
namespace: datasets/workflows/scrnaseq
description: |
Fetch and process datasets from the Open Problems Perurbation Prediction (OP3) dataset.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"process" can encompass a lot - would be great to make it more specific

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you managed to execute this nextflow pipeline, @Paulos2411 and check the output?

@rcannood
Copy link
Member

Superseded by #926

@rcannood rcannood closed this May 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants