Skip to content

umccr/tidywigits

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

logo

🧬✨ Tidy WiGiTS Outputs

conda-latest1 gha

Overview

{tidywigits} is an R package that parses and tidies outputs from the WiGiTS suite of genome and transcriptome analysis tools for cancer research and diagnostics, created by the Hartwig Medical Foundation.

In short, it traverses through a directory containing results from one or more runs of WiGiTS tools, parses any files it recognises, tidies them up (which includes data reshaping, normalisation, column name cleanup etc.), and writes them to the output format of choice e.g.Β Apache Parquet, PostgreSQL, TSV, RDS.

🎨 Quick Start

The starting point of {tidywigits} is a directory with WiGiTS results. Let’s look at some sample data (tracked via DVC) under https://github.com/umccr/tidywigits/tree/main/inst/extdata/oa:

Click here
system.file("extdata/oa", package = "tidywigits") |>
  fs::dir_tree(invert = TRUE, glob = "*.dvc")
/Users/pdiakumis/Library/R/arm64/4.5/library/tidywigits/extdata/oa
β”œβ”€β”€ alignments
β”‚   └── sample1.duplicate_freq.tsv
β”œβ”€β”€ amber
β”‚   β”œβ”€β”€ sample1.amber.baf.pcf
β”‚   β”œβ”€β”€ sample1.amber.contamination.tsv
β”‚   β”œβ”€β”€ sample1.amber.homozygousregion.tsv
β”‚   └── sample1.amber.qc
β”œβ”€β”€ bamtools
β”‚   └── sample1.wgsmetrics
β”œβ”€β”€ chord
β”‚   β”œβ”€β”€ sample1.chord.mutation_contexts.tsv
β”‚   └── sample1.chord.prediction.tsv
β”œβ”€β”€ cobalt
β”‚   β”œβ”€β”€ cobalt.version
β”‚   β”œβ”€β”€ sample1.cobalt.gc.median.tsv
β”‚   β”œβ”€β”€ sample1.cobalt.ratio.median.tsv
β”‚   └── sample1.cobalt.ratio.pcf
β”œβ”€β”€ cuppa
β”‚   β”œβ”€β”€ sample1.cuppa.pred_summ.tsv
β”‚   β”œβ”€β”€ sample1.cuppa.vis_data.tsv
β”‚   └── sample1.cuppa_data.tsv.gz
β”œβ”€β”€ lilac
β”‚   β”œβ”€β”€ sample1.lilac.candidates.coverage.tsv
β”‚   β”œβ”€β”€ sample1.lilac.qc.tsv
β”‚   └── sample1.lilac.tsv
β”œβ”€β”€ linx
β”‚   β”œβ”€β”€ germline_annotations
β”‚   β”‚   β”œβ”€β”€ linx.version
β”‚   β”‚   β”œβ”€β”€ sample1.linx.germline.breakend.tsv
β”‚   β”‚   β”œβ”€β”€ sample1.linx.germline.clusters.tsv
β”‚   β”‚   β”œβ”€β”€ sample1.linx.germline.disruption.tsv
β”‚   β”‚   β”œβ”€β”€ sample1.linx.germline.driver.catalog.tsv
β”‚   β”‚   β”œβ”€β”€ sample1.linx.germline.links.tsv
β”‚   β”‚   └── sample1.linx.germline.svs.tsv
β”‚   └── somatic_annotations
β”‚       β”œβ”€β”€ linx.version
β”‚       β”œβ”€β”€ sample1.linx.breakend.tsv
β”‚       β”œβ”€β”€ sample1.linx.clusters.tsv
β”‚       β”œβ”€β”€ sample1.linx.driver.catalog.tsv
β”‚       β”œβ”€β”€ sample1.linx.drivers.tsv
β”‚       β”œβ”€β”€ sample1.linx.fusion.tsv
β”‚       β”œβ”€β”€ sample1.linx.links.tsv
β”‚       β”œβ”€β”€ sample1.linx.svs.tsv
β”‚       β”œβ”€β”€ sample1.linx.vis_copy_number.tsv
β”‚       β”œβ”€β”€ sample1.linx.vis_fusion.tsv
β”‚       β”œβ”€β”€ sample1.linx.vis_gene_exon.tsv
β”‚       β”œβ”€β”€ sample1.linx.vis_protein_domain.tsv
β”‚       β”œβ”€β”€ sample1.linx.vis_segments.tsv
β”‚       └── sample1.linx.vis_sv_data.tsv
β”œβ”€β”€ purple
β”‚   β”œβ”€β”€ purple.version
β”‚   β”œβ”€β”€ sample1.purple.cnv.gene.tsv
β”‚   β”œβ”€β”€ sample1.purple.cnv.somatic.tsv
β”‚   β”œβ”€β”€ sample1.purple.driver.catalog.germline.tsv
β”‚   β”œβ”€β”€ sample1.purple.driver.catalog.somatic.tsv
β”‚   β”œβ”€β”€ sample1.purple.germline.deletion.tsv
β”‚   β”œβ”€β”€ sample1.purple.purity.range.tsv
β”‚   β”œβ”€β”€ sample1.purple.purity.tsv
β”‚   β”œβ”€β”€ sample1.purple.qc
β”‚   β”œβ”€β”€ sample1.purple.somatic.clonality.tsv
β”‚   └── sample1.purple.somatic.hist.tsv
β”œβ”€β”€ sage
β”‚   β”œβ”€β”€ germline
β”‚   β”‚   β”œβ”€β”€ sample1.sage.bqr.tsv
β”‚   β”‚   β”œβ”€β”€ sample2.sage.bqr.tsv
β”‚   β”‚   β”œβ”€β”€ sample2.sage.exon.medians.tsv
β”‚   β”‚   └── sample2.sage.gene.coverage.tsv
β”‚   └── somatic
β”‚       β”œβ”€β”€ sample1.sage.bqr.tsv
β”‚       β”œβ”€β”€ sample1.sage.exon.medians.tsv
β”‚       β”œβ”€β”€ sample1.sage.gene.coverage.tsv
β”‚       └── sample2.sage.bqr.tsv
β”œβ”€β”€ sigs
β”‚   β”œβ”€β”€ sample1.sig.allocation.tsv
β”‚   └── sample1.sig.snv_counts.csv
β”œβ”€β”€ virusbreakend
β”‚   └── sample1.virusbreakend.vcf.summary.tsv
└── virusinterpreter
    └── sample1.virus.annotated.tsv

We can parse, tidy up, and write the WiGiTS results into e.g.Β Parquet format or a PostgreSQL database as follows:

  • Parquet:
in_dir <- system.file("extdata/oa", package = "tidywigits")
out_dir <- tempdir() |> fs::dir_create("parquet_example")
w <- Wigits$new(in_dir)
res <- w$nemofy(odir = out_dir, format = "parquet", id = "parquet_example")
fs::dir_info(out_dir) |>
  dplyr::mutate(bname = basename(.data$path)) |>
  dplyr::select("bname", "size", "type")
# A tibble: 64 Γ— 3
   bname                                              size type 
   <chr>                                       <fs::bytes> <fct>
 1 sample1_2_sage_bqrtsv.parquet                      3.1K file 
 2 sample1_alignments_dupfreq.parquet                1.95K file 
 3 sample1_amber_bafpcf.parquet                      3.27K file 
 4 sample1_amber_contaminationtsv.parquet            4.13K file 
 5 sample1_amber_homozygousregion.parquet            3.18K file 
 6 sample1_amber_qc.parquet                          2.35K file 
 7 sample1_bamtools_wgsmetrics_histo.parquet         4.19K file 
 8 sample1_bamtools_wgsmetrics_metrics.parquet      10.12K file 
 9 sample1_chord_prediction.parquet                  3.43K file 
10 sample1_chord_signatures.parquet                  2.17K file 
# β„Ή 54 more rows
  • PostgreSQL:
in_dir <- system.file("extdata/oa", package = "tidywigits")
out_dir <- tempdir() |> fs::dir_create("parquet_example")
w <- Wigits$new(in_dir)
dbconn <- DBI::dbConnect(
  drv = RPostgres::Postgres(),
  dbname = "nemo",
  user = "orcabus"
)
res <- w$nemofy(
  format = "db",
  id = "db_example",
  dbconn = dbconn
)

IMPORTANT: support for VCFs is still under development.

πŸ• Installation

Using {remotes} directly from GitHub:

install.packages("remotes")
remotes::install_github("umccr/tidywigits") # latest main commit
remotes::install_github("umccr/tidywigits@v0.0.4") # released version

Alternatively:

For more details see: https://umccr.github.io/tidywigits/articles/installation

πŸŒ€ CLI

A tidywigits.R command line interface is available for convenience.

  • If you’re using the conda package, the tidywigits.R command will already be available inside the activated conda environment.
  • If you’re not using the conda package, you need to export the tidywigits/inst/cli/ directory to your PATH in order to use tidywigits.R.
tw_cli=$(Rscript -e 'x = system.file("cli", package = "tidywigits"); cat(x, "\n")' | xargs)
export PATH="${tw_cli}:${PATH}"
$ tidywigits.R --version
tidywigits 0.0.4

#-----------------------------------#
$ tidywigits.R --help
usage: tidywigits.R [-h] [-v] {tidy,list} ...

✨ WiGiTS Output Tidying ✨

positional arguments:
  {tidy,list}    sub-command help
    tidy         Tidy Workflow Outputs
    list         List Parsable Workflow Outputs

options:
  -h, --help     show this help message and exit
  -v, --version  show program's version number and exit
'
#-----------------------------------#
#------- Tidy ----------------------#
$ tidywigits.R tidy --help
usage: tidywigits.R tidy [-h] -d IN_DIR [-o OUT_DIR] [-f FORMAT] -i ID
                         [--dbname DBNAME] [--dbuser DBUSER]
                         [--include INCLUDE] [--exclude EXCLUDE] [-q]

options:
  -h, --help            show this help message and exit
  -d IN_DIR, --in_dir IN_DIR
                        Input directory.
  -o OUT_DIR, --out_dir OUT_DIR
                        Output directory.
  -f FORMAT, --format FORMAT
                        Format of output [def: parquet] (parquet, db, tsv,
                        csv, rds)
  -i ID, --id ID        ID to use for this run.
  --dbname DBNAME       Database name.
  --dbuser DBUSER       Database user.
  --include INCLUDE     Include only these files (comma,sep).
  --exclude EXCLUDE     Exclude these files (comma,sep).
  -q, --quiet           Shush all the logs.

#-----------------------------------#
#------- List ----------------------#
$ tidywigits.R list --help
usage: tidywigits.R list [-h] -d IN_DIR [-f FORMAT] [-q]

options:
  -h, --help            show this help message and exit
  -d IN_DIR, --in_dir IN_DIR
                        Input directory.
  -f FORMAT, --format FORMAT
                        Format of list output [def: pretty] (tsv, pretty)
  -q, --quiet           Shush all the logs.

About

🧬✨ Tidy Hartwig WiGiTS pipeline outputs

Topics

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages