Skip to content

Task: build

Simon Hackl edited this page May 19, 2025 · 5 revisions

The build task is the initial step in any MUSIAL analysis. It constructs a MUSIAL storage file by aggregating and harmonizing variant calls and generates a compressed JSON file that serves as input for subsequent tasks.

java -jar MUSIAL-v2.4.0.jar build -C <path-to-config.json>

Required argument:

  • -C, --configuration <arg>
    Path to a JSON file specifying the build configuration.

Configuration File

The build configuration is a JSON file validated against the build schema and should look somewhat like the following example:

{
  "reference": "reference.fasta",
  "annotation": "annotation.gff3",
  "features": "features.tsv",
  "vcfInput": ["sample1.vcf", "other_samples/"],
  "vcfMeta": "metadata.csv",
  "output": "musial_storage.json.gz",
  "minimalCoverage": 3.0,
  "minimalFrequency": 0.65,
  "storeFiltered": false,
  "skipSnpEff": false,
  "skipProteoformInference": true
}

The individual properties of the configuration file are described briefly below, while a comprehensive overview with all matching patterns can be viewed at buildConfigurationSchema.json

Field Description
reference Path to a FASTA file, defines reference sequences.
annotation Path to a GFF3 file, defines features on the reference sequence.
features Path to a .tsv/.csv file describing which GFF3 features to extract.
excludedPositions BED/TSV/CSV file listing positions to exclude.
excludedVariants TSV/CSV file listing variant calls to exclude.
vcfInput List of VCF files or directories containing VCFs.
vcfMeta CSV/TSV file with sample metadata. The first column must match genotypes/sample names in VCF.
output Output path for the MUSIAL storage file. Defaults to musial_storage_<date>.json.gz, if a directory is provided.
minimalCoverage Minimum total read depth for accepting a variant (default: 3).
minimalFrequency Minimum fraction of variant-supporting reads at a position (default: 0.65).
storeFiltered If true, filtered variants are retained with ambiguous content (N).
skipSnpEff If true, skips variant effect prediction using SnpEff.
skipProteoformInference If true, skips proteoform inference.

Data Inputs and Behavior

VCF Input (vcfInput)

  • Supports directories or VCF files. All VCF files contained in the specified directories are parsed.
  • All VCF input files should comply with the VCF v4.5 specification.
  • Each alternative genotype must have an AD info property. Each homozygous reference must have a DP info property.
  • Irrespective of the ploidy in the VCF files, only the most common alternative is retained as a variant.
  • Both single- and multi-sample (cohort) VCFs are supported.
    • If a variant call for a sample is present in several files, the read depths are totaled.
    • Sample names in a VCF file can be supplemented by a suffix $..., which merges listed variant calls or totals their read depths.
  • VCF files are temporarily copied into the system’s temporary directory for processing. This can be changed using: java -Djava.io.tmpdir=/path/to/tmpdir.

Metadata (vcfMeta)

  • Entries in the first column are matched against sample names in the VCF genotype fields.
  • Remaining columns can be arbitrary annotations and are stored as sample attributes.

Reference and Annotation

  • GFF3 must comply with the GFF3 specification and contain correct parent-child hierarchy.
  • An index of the FASTA reference is created in memory.
  • GFF3 must not contain embedded FASTA sequence data.
  • FASTA files must not end with a double line break.
  • The following input rules apply:
Condition Behavior
No reference given The CHROM information from VCF files is resolved into features, ranging from 1 to the largest position at which a variant was observed.
Only reference given Each FASTA contig becomes a feature.
reference + annotation All GFF3 entries are used.
reference + annotation + features Only matched GFF3 entries are used.
annotation without reference ❌ Not allowed.
features without annotation ❌ Not allowed.

Feature Selection via GFF3

Features are defined using key-value pairs matched against the 9th column of the GFF3 annotation file.

Example GFF3:

Contig1	Genbank	gene	22085	24172	.	+	.	ID=gene-0230;Name=priA;gene=priA;gbkey=Gene;...
Contig1	Protein Homology	CDS	22085	24172	.	+	0	ID=cds-0230;Parent=gene-0230;Name=P0230;gbkey=CDS;product=primosomal protein N';...
Contig1	RefSeq	gene	38311	39945	.	-	.	ID=gene-0305;Name=groL;gene=groL;gbkey=Gene;...
Contig1	Protein Homology	CDS	38311	39945	.	+	0	ID=cds-0305;Parent=gene-0305;Name=P0305;gbkey=CDS;Ontology_term=GO:0006457,GO:0005515,GO:0016887;...

Matching features.tsv:

Name	priA
ID	gene-0455

Output

  • A Gzip-compressed JSON file (.json.gz) is created at the output location.
  • Contains harmonized and filtered variant calls, annotations, and metadata.
Clone this wiki locally