Skip to content

Added op3loader #926

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 27 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
c14c06a
Added op3 loader
Paulos2411 Feb 25, 2025
d267844
Minor Changes
Paulos2411 Feb 26, 2025
1c5733c
Updated Code with Output Data
Paulos2411 Mar 11, 2025
9ee60b5
Minor Updates
Paulos2411 Mar 18, 2025
7ab54e5
Minor Updates
Paulos2411 Mar 18, 2025
2fb2aed
Minor Update
Paulos2411 Mar 18, 2025
b924061
Added Workflow
Paulos2411 Mar 19, 2025
9b6efac
Update script.py
Paulos2411 Mar 18, 2025
6fbaa71
workflow for op3_loader
Paulos2411 Mar 19, 2025
b38d93f
Implemented Suggestions
Paulos2411 Mar 25, 2025
573bd28
removed space
Paulos2411 Mar 25, 2025
d854e13
No changes
Paulos2411 Mar 25, 2025
d869f8b
move input file to a separate argument
rcannood May 16, 2025
820abfa
add initial script
rcannood May 16, 2025
e35be7d
added small fixes for op3 pipeline
May 22, 2025
51ef0f7
Removed two obsolete changes
May 22, 2025
746dee3
removed scanpy cell and gene filtering
May 23, 2025
aa841e1
an example script for running the data processing pipeline is added
May 23, 2025
4c0829d
changed the name of an input file
May 23, 2025
c24a0c1
removed .r filtration script
May 28, 2025
83a08cf
unnecessary import is deleted in script.py
Olga013 May 29, 2025
f5d0989
unnecessary state is deleted in process_op3/main.nf
Olga013 May 29, 2025
672e34d
duplicated rows are deleted in process_op3/config.vsh.yaml
Olga013 May 29, 2025
506e04e
minor fixes in .bash scripts are added
Jun 12, 2025
abdd608
fixing splits is added
Jun 12, 2025
37525d0
publish_dir and main-script dir were changed
Jun 13, 2025
1026ba3
the name of dir in s3 is changed
Jun 13, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
107 changes: 107 additions & 0 deletions src/datasets/loaders/scrnaseq/op3_loader/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
name: op3_loader
namespace: datasets/loaders/scrnaseq
description: |
"Loads and preprocesses the OP3 dataset from GEO accession GSE279945."

argument_groups:
- name: Input
arguments:
- name: "--input"
type: string
description: "Input url to the .h5ad file."
direction: input
required: false
default: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE279nnn/GSE279945/suppl/GSE279945_sc_counts_processed.h5ad
example: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE279nnn/GSE279945/suppl/GSE279945_sc_counts_processed.h5ad
- name: "--var_feature_name"
type: string
description: "Location of where to find the feature names. Can be set to index if the feature names are the index."
default: index
- name: Data Filtering
description: "Arguments for filtering the dataset"
arguments:
- name: "--donor_id"
type: string
description: "Donor ID to filter for (1, 2, or 3). If not specified, all donors are included."
required: false
- name: "--cell_type"
type: string
description: "Cell type to filter for (T cells, B cells, NK cells, or Myeloid). If not specified, all cell types are included."
required: false
- name: "--perturbation"
type: string
description: "Perturbation to filter for. If not specified, all perturbations are included."
required: false

- name: Dataset Metadata
description: "Metadata about the dataset"
arguments:
- name: "--dataset_id"
type: string
description: "Unique identifier for the dataset"
default: "op3"
- name: "--dataset_name"
type: string
description: "Human-readable name for the dataset"
default: "OP3: single-cell multimodal dataset in PBMCs for perturbation prediction benchmarking"
- name: "--dataset_summary"
type: string
description: "Short summary of the dataset"
default: "The Open Problems Perurbation Prediction (OP3) dataset with small molecule perturbations in PBMCs"
- name: "--dataset_description"
type: string
description: "Detailed description of the dataset"
default: "The OP3 dataset is to-date the largest single-cell small molecule perturbation dataset in primary tissue with multiple donor replicates."
- name: "dataset_reference"
type: string
description: "Bibtex reference of the paper in which the dataset was published."
required: false
default: GSE279945
- name: "--dataset_url"
type: string
description: "Link to the original source of the dataset."
required: false
default: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE279nnn/GSE279945/suppl/GSE279945_sc_counts_processed.h5ad

- name: Output
description: "Output parameters"
arguments:
- name: "--output"
type: file
description: "Output h5ad file."
direction: output
required: true
- name: "--output_compression"
type: string
choices: [gzip, lzf]
required: false
default: "gzip"

resources:
- type: python_script
path: script.py

engines:
- type: docker
image: python:3.11
setup:
- type: python
packages:
- scanpy
- anndata
- pandas
- numpy
- requests
test_setup:
- type: python
packages:
- viashpy

runners:
- type: executable
- type: nextflow

test_resources:
- type: python_script
path: test.py

Loading