TCGA-Tools

TCGA-Tools is a framework to facilitate the easy download of TCGA public Whole Slide Image (WSI) datasets for computational pathology projects. It leverages the GDC Data Transfer Tool (gdc-client) to efficiently download large datasets and processes clinical and biospecimen annotations for downstream analyses like survival prediction and classification.

Features

GDC-Client Build & Installation:
Automatically builds the gdc-client executable from source using a Git submodule.
Clean data storage: Ensures the datasets are stored in a clean and manageable way (e.g. no subdirectories) alogn with their annotations
Annotation Processing:
Processes clinical and biospecimen annotation files (tar.gz) from raw TCGA data to generate survival prediction and classification annotation files, which can be used for downstream tasks like PathBench-MIL
Configurable Workflow:
Provides a command-line interface with multiple parameters for customization.

Requirements

Operating System:
Unix-like (Linux/macOS) or Windows (with a Unix-like shell, e.g., Git Bash).
Python:
Python 3.x
Python Packages:
- pandas
- tqdm
Optional:
SLURM for HPC job submission

Repository Structure

After installation, your repository root should resemble:

TCGA-Tools/
├── gdc-client/               # gdc-client Git submodule (cloned from https://github.com/NCI-GDC/gdc-client)
├── gdc-client_exec           # Built gdc-client executable
├── install_gdc.sh            # Script to build and install gdc-client and set up the venv
├── LICENSE
├── manifest_mapping.csv
├── manifests/                # Contains TCGA manifest files (e.g., gdc_manifest.tcga_brca.txt)
├── raw_annotations/          # Contains raw annotation tar.gz files (clinical and biospecimen)
├── README.md
├── tcga_tools.py             # Main download and processing script
└── venv/                     # Local virtual environment

Installation

1. Clone the Repository with Submodules

Clone the repository (including the gdc-client submodule) with:

git clone --recurse-submodules https://github.com/your_username/TCGA-Tools.git
cd TCGA-Tools

Build and Install gdc-client Run the provided installation script to build the gdc-client executable from the submodule. The script will install gdc-client, its virtual environment and the virtual environment for TCGA-tools.

./install_gdc.sh

After the script completes, verify that gdc-client_exec exists in your repository root.

Usage

The main script, tcga_tools.py, downloads TCGA datasets using the gdc-client executable and processes annotation files. Its command-line interface supports multiple parameters.

Basic Usage

venv/bin/python tcga_tools.py --datasets TCGA-LUSC TCGA-BRCA \ # Can be any of the datasets present in ./manifests
    --parent-dir /path/to/data \
    --manifest-dir /path/to/manifests \ # Defaults to./manifests
    --raw-annotations-dir /path/to/raw_annotations # Defaults to ./raw_annotations

Input Parameters and Defaults

--datasets: Required. List of TCGA dataset names to download (e.g., TCGA-LUSC TCGA-BRCA).

--parent-dir: Required. Parent directory where the downloaded datasets will be stored.

--manifest-dir: Directory containing the manifest files. Default: ./manifests

--raw-annotations-dir: Directory containing the raw annotation tar.gz files. Default: ./raw_annotations

--gdc-client-dir: Path to the gdc-client submodule directory. Default: ./gdc-client

--gdc-client-path: Path to the gdc-client executable. Default: ./gdc-client_exec

--build-gdc: If specified, forces building the gdc-client executable from source (even if one already exists).

--n-processes: (Optional) Number of parallel download processes. Passed to gdc-client with -n. For example, --n-processes 4.

--verbose: (Optional) Enable verbose output from gdc-client (passed as --debug).

Example: Using 4 Processes with Verbose Output

python tcga_tools.py --datasets TCGA-LUSC TCGA-BRCA \
    --parent-dir "TCGA" \
    --manifest-dir "manifests" \
    --raw-annotations-dir "raw_annotations" \
    --n-processes 4 \
    --verbose

Running on HPC (SLURM)

Below is an example SLURM job script (shark_run.sbatch) for running the download with 4 processes:

#!/bin/sh
#SBATCH -J TCGA_DOWNLOAD
#SBATCH --mem=10G
#SBATCH --partition=all
#SBATCH --time=300:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4

module purge > /dev/null 2>&1
module load system/gcc/13.2.0
module load system/python/3.9.17
module load tools/miniconda/python3.9/4.12.0

source venv/bin/activate

echo "Running TCGA download..."
venv/bin/python tcga_tools.py --datasets TCGA-LUSC TCGA-BRCA \
    --parent-dir "TCGA" \
    --manifest-dir "manifests" \
    --raw-annotations-dir "raw_annotations" \
    --n-processes 4 \
    --verbose

echo "TCGA download complete."
deactivate

Submit the job with:

sbatch shark_run.sbatch

Output

TCGA-Tools provides a clean directory output structure where slides and annotations are stored. For example, let us assume we have set 'parent_dir' to 'TCGA', and we downloaded the TCGA-LUSC and TCGA-BRCA datasets. Then our directory should look like this:

📂 TCGA/
 ├── 📂 TCGA-LUSC/                      # Dataset directory (one per dataset)
 │   ├── 📝 sample_to_case_map_tcga_lusc.tsv # Sample ID → Case ID mapping
 │   ├── 📝 tcga-TCGA-LUSC-download.log  # Download log file
 │   ├── 📜 TCGA-60-2712-01Z.svs         # Slide file (renamed based on Sample ID)
 │   ├── 📜 TCGA-56-7221-01Z.svs
 │   ├── 📜 TCGA-21-A5DI-01Z.svs
 │   ├── 📜 TCGA-60-2695-01Z.svs
 │   ├── 📜 TCGA-60-2710-01Z.svs
 │   ├── 📜 TCGA-22-5472-01Z.svs
 │   ├── 📜 ... (more .svs slides)
 │   ├── 📝 clinical_tcga_lusc.tsv       # Extracted clinical data (dataset-specific)
 │   ├── 📝 survival_annotation_tcga_lusc.tsv   # Processed survival data
 │   ├── 📝 classification_annotation_tcga_lusc.tsv # Processed classification data
 │   ├── 📝 primary_diagnosis_annotation_tcga_lusc.tsv # Diagnosis-based classification
 │
 ├── 📂 TCGA-BRCA/                      # Another dataset, same structure
 │   ├── 📝 sample_to_case_map_tcga_brca.tsv
 │   ├── 📝 tcga-TCGA-BRCA-download.log
 │   ├── 📜 TCGA-XX-XXXX-01Z.svs
 │   ├── 📝 clinical_tcga_brca.tsv
 │   ├── 📝 survival_annotation_tcga_brca.tsv
 │   ├── 📝 classification_annotation_tcga_brca.tsv
 │   ├── 📝 primary_diagnosis_annotation_tcga_brca.tsv
 │
📂 manifests/                       # Directory for manifest files
 ├── 📝 gdc_manifest.tcga_lusc.txt
 ├── 📝 gdc_manifest.tcga_brca.txt
 │
📂 raw_annotations/                 # Directory for raw annotation files
 ├── 📜 clinical.project-tcga_lusc.tar.gz
 ├── 📜 biospecimen.project-tcga_lusc.tar.gz
 ├── 📜 clinical.project-tcga_brca.tar.gz
 ├── 📜 biospecimen.project-tcga_brca.tar.gz
 ├── 📝 gdc_sample_sheet_tcga_lusc.txt
 ├── 📝 gdc_sample_sheet_tcga_brca.txt
 |
 ├── 📝 README.md

Contributing

Contributions, bug reports, and feature requests are welcome. Please open issues or submit pull requests on GitHub.

License

This project is licensed under the terms specified in the LICENSE file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TCGA-Tools

Features

Requirements

Repository Structure

Installation

1. Clone the Repository with Submodules

Usage

Basic Usage

Input Parameters and Defaults

Running on HPC (SLURM)

Output

Contributing

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
gdc-client @ 8c9efd5		gdc-client @ 8c9efd5
manifests		manifests
raw_annotations		raw_annotations
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
install_gdc.sh		install_gdc.sh
shark_run.sh		shark_run.sh
tcga_tools.py		tcga_tools.py

License

LUMCPathAI/TCGA-Tools

Folders and files

Latest commit

History

Repository files navigation

TCGA-Tools

Features

Requirements

Repository Structure

Installation

1. Clone the Repository with Submodules

Usage

Basic Usage

Input Parameters and Defaults

Running on HPC (SLURM)

Output

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages