Skip to content

ENCODE-DCC/ENCODE_scatac

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🏭 ENCODE sc/snATAC Automated Processing

Note: This pipeline is currently a work in progress.

This is the automated portion of the ENCODE single-cell/single-nucleus ATAC-Seq pipeline.

Information on the specific analysis steps can be found in the pipeline specification document.

Requirements

  • A Linux-based OS
  • A conda-based Python 3 installation
  • Snakemake v6.6.1+ (full installation)
  • An ENCODE DCC account with access to the necessary datasets

Additional requirements for cloud execution:

  • Kubectl
  • A cloud provider CLI for Kubernetes cluster creation
  • A cloud provider CLI for remote storage (if different from above)

All other dependencies are handled by the pipeline itself

Running the Pipeline

Local Execution

  1. Install any necessary requirements above
  2. Download the pipeline
    git clone https://github.com/kundajelab/ENCODE_scatac
    
  3. Activate the snakemake conda environment:
    conda activate snakemake
    
  4. Configure the pipeline in the /config directory. Detailed information can be found here.
  5. Run the pipeline:
    snakemake -k --use-conda --cores $NCORES 
    
    Here, $NCORES is the number of cores to utilize

Note: When run for the first time, the pipeline will take some time to install conda packages.

Cloud Execution with Kubernetes

  1. Install and configure the pipeline as specified above
  2. Create a cloud cluster. Note that setup specifics may differ depending on the cloud provider. Example setup instructions for GCP and for Azure.
  3. Configure remote storage. Instructions for each provider can be found here. For our purpose, only the environment variables and command line configuration are needed.
  4. Run the pipeline:
    snakemake -k --kubernetes --use-conda --default-remote-provider $REMOTE --default-remote-prefix $PREFIX --jobs $NJOBS --envvars $VARS
    
    Here:
    • $REMOTE is the cloud storage provider, and should be one of {S3,GS,FTP,SFTP,S3Mocked,gfal,gridftp,iRODS,AzBlob,XRootD}
    • $PREFIX is the target bucket name or subfolder in storage
    • $NJOBS is the maximum number of jobs to be run in parallel
    • $VARS is a list of environment variables for accessing remote storage. The --envvars flag can be omitted if no variables are required.

Additional Execution Modes

This pipeline has been tested locally and on the cloud via Kubernetes. However, Snakemake offers a number of additional execution modes.

Documentation on cluster execution

Documentation on cloud execution

Authors

Austin Wang
Primary developer
atwang@stanford.edu

Surag Nair
Secondary developer and advisor
surag@stanford.edu

Ben Parks
Secondary developer and advisor
bparks@stanford.edu

Laksshman Sundaram
Advisor
lakss@stanford.edu

Caleb Lareau
Advisor
clareau@stanford.edu

William Greenleaf
Supervisor
wjg@stanford.edu

Anshul Kundaje
Supervisor
akundaje@stanford.edu

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 94.3%
  • R 3.6%
  • Shell 1.9%
  • Awk 0.2%