Python package for performing quality control (QC) for data coordination (DC)
This Python package provides a framework for performing quality control (QC) on data files. Quality control can range from low-level integrity checks (e.g. MD5 checksum, file extension) to high-level checks such as conformance to a format specification and consistency with associated metadata.
The tool is designed to be flexible and extensible, allowing for:
- File integrity validation
- Format specification conformance
- Metadata consistency checks
- Custom test suite creation
- Integration with external QC tools
- Batch processing of multiple files
- Comprehensive reporting in JSON format
A File
represents a local or remote file along with its metadata. Each file has an associated FileType
that bundles information about:
- Valid file extensions
- EDAM format ontology identifiers
- File type-specific validation rules
Built-in file types include TXT, JSON, JSON-LD, TIFF, OME-TIFF, TSV, CSV, BAM, FASTQ, and HDF5.
A Target
represents one or more files that should be validated together. There are two types of targets:
SingleTarget
: For validating individual filesPairedTarget
: For validating exactly two related files together (e.g., paired-end sequencing data)
Tests are individual validation checks that can be run on targets. There are two types of tests:
-
Internal Tests: Tests written and executed in Python
- File extension validation
- Metadata consistency checks
- Format validation
-
External Tests: Tests that utilize external tools or processes
- File integrity checks (MD5, checksums)
- Format-specific validation tools
- Custom validation scripts
Tests are further organized into tiers:
-
Tier 1 - File Integrity: Checking that the file is whole and "available". These tests verify basic file integrity and usually require additional information, including:
- MD5 checksum verification
- Expected file extension checks
- Format-specific checks (e.g., first/last bytes)
- Decompression checks if applicable
-
Tier 2 - Internal Conformance: Checking that the file is internally consistent and compliant with its stated format. These tests only need the files themselves and their format specification:
- File format validation using available tools
- Internal metadata validation against schema (e.g., OME XML)
- Additional checks on internal metadata
-
Tier 3 - External Conformance: Checking that file features are consistent with separately submitted metadata. These tests use additional information but remain objective/quantitative:
- Channel count consistency
- File/image size consistency
- Antibody nomenclature conformance
- Secondary file presence (e.g., CRAI file for CRAM)
-
Tier 4 - Subjective Conformance: Checking files against qualitative criteria that may need expert review. These tests often involve metrics, heuristics, or sophisticated models:
- Sample swap detection
- PHI detection in images and metadata
- Outlier detection using metrics (e.g., file size)
A Suite
is a collection of tests that are specific to a particular file type (e.g., FASTQ, BAM, CSV). Each file type has its own suite of tests that are appropriate for that format. Suites:
- Group tests together based on the target file type
- Can specify required vs optional tests:
- By default, Tier 1 (File Integrity) and Tier 2 (Internal Conformance) tests are required
- Users can explicitly specify which tests are required by name
- Allow tests to be skipped if specified in the suite
- Provide overall validation status:
- GREEN: All tests passed
- RED: One or more required tests failed
- AMBER: All required tests passed, but optional tests failed
- GREY: Error occurred during testing
Reports provide structured output of test results in various formats:
- JSON reports for machine readability
- CSV updates for batch processing
- Detailed test status and error messages
- Aggregated results across multiple suites
You can install py-dcqc directly from PyPI:
pip install dcqc
For development installation from source:
git clone https://github.com/Sage-Bionetworks-Workflows/py-dcqc.git
cd py-dcqc
pip install -e .
You can also use the official Docker container:
docker pull ghcr.io/sage-bionetworks-workflows/py-dcqc:latest
To run commands using the Docker container:
docker run ghcr.io/sage-bionetworks-workflows/py-dcqc:latest dcqc --help
For processing local files, remember to mount your data directory:
docker run -v /path/to/your/data:/data ghcr.io/sage-bionetworks-workflows/py-dcqc:latest dcqc qc_file --input-file /data/myfile.csv --file-type csv
To see all available commands and their options:
dcqc --help
Main commands include:
create_targets
: Create target JSON files from a targets CSV filecreate_tests
: Create test JSON files from a target JSON filecreate_process
: Create external process JSON file from a test JSON filecompute_test
: Compute the test status from a test JSON filecreate_suite
: Create a suite from a set of test JSON files sharing the same targetcombine_suites
: Combine several suite JSON files into a single JSON reportlist_tests
: List the tests available for each file typeqc_file
: Run QC tests on a single file (external tests are skipped)update_csv
: Update input CSV file with dcqc_status column
For detailed help on any command:
dcqc <command> --help
Run QC on a single file:
dcqc qc-file --input-file data.csv --file-type csv --metadata '{"author": "John Doe"}'
- Create targets from a CSV file:
dcqc create-targets input_targets.csv output_dir/
- Create tests for a target:
dcqc create-tests target.json tests_dir/ --required-tests "ChecksumTest" "FormatTest"
- Run tests and create a suite:
dcqc create-suite --output-json results.json test1.json test2.json test3.json
To see all available tests for different file types:
dcqc list-tests
Early versions of this package were developed to be used by its sibling, the nf-dcqc Nextflow workflow. The initial command-line interface was developed with nf-dcqc in mind, favoring smaller steps to enable parallelism in Nextflow.
This project has been set up using PyScaffold 4.3. For details and usage information on PyScaffold see https://pyscaffold.org/.
putup --name dcqc --markdown --github-actions --pre-commit --license Apache-2.0 py-dcqc