Skip to content

Sage-Bionetworks-Workflows/py-dcqc

Repository files navigation

py-dcqc

PyPI-Server codecov Project generated with PyScaffold

Python package for performing quality control (QC) for data coordination (DC)

Purpose

This Python package provides a framework for performing quality control (QC) on data files. Quality control can range from low-level integrity checks (e.g. MD5 checksum, file extension) to high-level checks such as conformance to a format specification and consistency with associated metadata.

The tool is designed to be flexible and extensible, allowing for:

  • File integrity validation
  • Format specification conformance
  • Metadata consistency checks
  • Custom test suite creation
  • Integration with external QC tools
  • Batch processing of multiple files
  • Comprehensive reporting in JSON format

Core Concepts

Files and FileTypes

A File represents a local or remote file along with its metadata. Each file has an associated FileType that bundles information about:

  • Valid file extensions
  • EDAM format ontology identifiers
  • File type-specific validation rules

Built-in file types include TXT, JSON, JSON-LD, TIFF, OME-TIFF, TSV, CSV, BAM, FASTQ, and HDF5.

Targets

A Target represents one or more files that should be validated together. There are two types of targets:

  • SingleTarget: For validating individual files
  • PairedTarget: For validating exactly two related files together (e.g., paired-end sequencing data)

Tests

Tests are individual validation checks that can be run on targets. There are two types of tests:

  1. Internal Tests: Tests written and executed in Python

    • File extension validation
    • Metadata consistency checks
    • Format validation
  2. External Tests: Tests that utilize external tools or processes

    • File integrity checks (MD5, checksums)
    • Format-specific validation tools
    • Custom validation scripts

Tests are further organized into tiers:

  • Tier 1 - File Integrity: Checking that the file is whole and "available". These tests verify basic file integrity and usually require additional information, including:

    • MD5 checksum verification
    • Expected file extension checks
    • Format-specific checks (e.g., first/last bytes)
    • Decompression checks if applicable
  • Tier 2 - Internal Conformance: Checking that the file is internally consistent and compliant with its stated format. These tests only need the files themselves and their format specification:

    • File format validation using available tools
    • Internal metadata validation against schema (e.g., OME XML)
    • Additional checks on internal metadata
  • Tier 3 - External Conformance: Checking that file features are consistent with separately submitted metadata. These tests use additional information but remain objective/quantitative:

    • Channel count consistency
    • File/image size consistency
    • Antibody nomenclature conformance
    • Secondary file presence (e.g., CRAI file for CRAM)
  • Tier 4 - Subjective Conformance: Checking files against qualitative criteria that may need expert review. These tests often involve metrics, heuristics, or sophisticated models:

    • Sample swap detection
    • PHI detection in images and metadata
    • Outlier detection using metrics (e.g., file size)

Suites

A Suite is a collection of tests that are specific to a particular file type (e.g., FASTQ, BAM, CSV). Each file type has its own suite of tests that are appropriate for that format. Suites:

  • Group tests together based on the target file type
  • Can specify required vs optional tests:
    • By default, Tier 1 (File Integrity) and Tier 2 (Internal Conformance) tests are required
    • Users can explicitly specify which tests are required by name
  • Allow tests to be skipped if specified in the suite
  • Provide overall validation status:
    • GREEN: All tests passed
    • RED: One or more required tests failed
    • AMBER: All required tests passed, but optional tests failed
    • GREY: Error occurred during testing

Reports

Reports provide structured output of test results in various formats:

  • JSON reports for machine readability
  • CSV updates for batch processing
  • Detailed test status and error messages
  • Aggregated results across multiple suites

Installation

You can install py-dcqc directly from PyPI:

pip install dcqc

For development installation from source:

git clone https://github.com/Sage-Bionetworks-Workflows/py-dcqc.git
cd py-dcqc
pip install -e .

Docker

You can also use the official Docker container:

docker pull ghcr.io/sage-bionetworks-workflows/py-dcqc:latest

To run commands using the Docker container:

docker run ghcr.io/sage-bionetworks-workflows/py-dcqc:latest dcqc --help

For processing local files, remember to mount your data directory:

docker run -v /path/to/your/data:/data ghcr.io/sage-bionetworks-workflows/py-dcqc:latest dcqc qc_file --input-file /data/myfile.csv --file-type csv

Command Line Interface

To see all available commands and their options:

dcqc --help

Main commands include:

  • create_targets: Create target JSON files from a targets CSV file
  • create_tests: Create test JSON files from a target JSON file
  • create_process: Create external process JSON file from a test JSON file
  • compute_test: Compute the test status from a test JSON file
  • create_suite: Create a suite from a set of test JSON files sharing the same target
  • combine_suites: Combine several suite JSON files into a single JSON report
  • list_tests: List the tests available for each file type
  • qc_file: Run QC tests on a single file (external tests are skipped)
  • update_csv: Update input CSV file with dcqc_status column

For detailed help on any command:

dcqc <command> --help

Example Usage

Basic File QC

Run QC on a single file:

dcqc qc-file --input-file data.csv --file-type csv --metadata '{"author": "John Doe"}'

Creating and Running Test Suites

  1. Create targets from a CSV file:
dcqc create-targets input_targets.csv output_dir/
  1. Create tests for a target:
dcqc create-tests target.json tests_dir/ --required-tests "ChecksumTest" "FormatTest"
  1. Run tests and create a suite:
dcqc create-suite --output-json results.json test1.json test2.json test3.json

Listing Available Tests

To see all available tests for different file types:

dcqc list-tests

Integration with nf-dcqc

Early versions of this package were developed to be used by its sibling, the nf-dcqc Nextflow workflow. The initial command-line interface was developed with nf-dcqc in mind, favoring smaller steps to enable parallelism in Nextflow.

PyScaffold

This project has been set up using PyScaffold 4.3. For details and usage information on PyScaffold see https://pyscaffold.org/.

putup --name dcqc --markdown --github-actions --pre-commit --license Apache-2.0 py-dcqc

About

Python package for performing quality control (QC) for data coordination (DC)

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors 5

Languages