Skip to content

Task: build

Simon Hackl edited this page May 19, 2025 · 5 revisions

The build task is the initial step in any MUSIAL analysis. It constructs a MUSIAL storage file by aggregating and harmonizing variant calls and generates a compressed JSON file that serves as input for subsequent tasks.

java -jar MUSIAL-v2.4.0.jar build -C <path-to-config.json>

Required argument:

  • -C, --configuration <arg>
    Path to a JSON file specifying the build configuration.

Configuration File

The build configuration is a JSON file validated against the build schema and should look somewhat like the following example:

{
  "reference": "reference.fasta",
  "annotation": "annotation.gff3",
  "features": "features.tsv",
  "vcfInput": ["sample1.vcf", "other_samples/"],
  "vcfMeta": "metadata.csv",
  "output": "musial_storage.json.gz",
  "minimalCoverage": 3.0,
  "minimalFrequency": 0.65,
  "storeFiltered": false,
  "skipSnpEff": false,
  "skipProteoformInference": true
}

The individual properties of the configuration file are described briefly below, while a comprehensive overview with all matching patterns can be viewed at buildConfigurationSchema.json

Field Description
reference Path to a FASTA file, defines reference sequences.
annotation Path to a GFF3 file, defines features on the reference sequence.
features Path to a .tsv/.csv file describing which GFF3 features to extract.
excludedPositions BED/TSV/CSV file listing positions to exclude.
excludedVariants TSV/CSV file listing variant calls to exclude.
vcfInput* List of VCF files or directories containing VCFs.
vcfMeta CSV/TSV file with sample metadata. The first column must match genotypes/sample names in VCF.
output* Output path for the MUSIAL storage file. Defaults to musial_storage_<date>.json.gz, if a directory is provided.
minimalCoverage Minimum total read depth for accepting a variant (default: 3).
minimalFrequency Minimum fraction of variant-supporting reads at a position (default: 0.65).
storeFiltered If true, filtered variants are retained with ambiguous content N (default: false).
skipSnpEff If true, skips variant effect prediction using SnpEff (default: false).
skipProteoformInference If true, skips proteoform inference (default: false).

* Required

Data Inputs and Behavior

VCF Input (vcfInput)

  • Supports directories or VCF files. All VCF files contained in the specified directories are parsed.
  • All VCF input files should comply with the VCF v4.5 specification.
  • Each alternative genotype must have an AD info property. Each homozygous reference must have a DP info property.
  • Irrespective of the ploidy in the VCF files, only the most common alternative is retained as a variant.
  • Both single- and multi-sample (cohort) VCFs are supported.
    • If a variant call for a sample is present in several files, the read depths are totaled.
    • Sample names in a VCF file can be supplemented by a suffix $..., which merges listed variant calls or totals their read depths.
  • VCF files are temporarily copied into the system’s temporary directory for processing. This can be changed using: java -Djava.io.tmpdir=/path/to/tmpdir.

Metadata (vcfMeta)

  • Entries in the first column are matched against sample names in the VCF genotype fields.
  • Remaining columns can be arbitrary annotations and are stored as sample attributes.

Reference and Annotation

  • GFF3 must comply with the GFF3 specification and contain correct parent-child hierarchy.
  • An index of the FASTA reference is created in memory.
  • GFF3 must not contain embedded FASTA sequence data.
  • FASTA files must not end with a double line break.
  • The following input rules apply:
Condition Behavior
No reference given The CHROM information from VCF files is resolved into features, ranging from 1 to the largest position at which a variant was observed.
Only reference given Each FASTA contig becomes a feature.
reference + annotation All GFF3 entries are used.
reference + annotation + features Only matched GFF3 entries are used.
annotation without reference ❌ Not allowed.
features without annotation ❌ Not allowed.

Feature Selection via GFF3

Features are defined using key-value pairs matched against the 9th column of the GFF3 annotation file.

Example GFF3:

Contig1	Genbank	gene	22085	24172	.	+	.	ID=gene-0230;Name=priA;gene=priA;gbkey=Gene;...
Contig1	Protein Homology	CDS	22085	24172	.	+	0	ID=cds-0230;Parent=gene-0230;Name=P0230;gbkey=CDS;product=primosomal protein N';...
Contig1	RefSeq	gene	38311	39945	.	-	.	ID=gene-0305;Name=groL;gene=groL;gbkey=Gene;...
Contig1	Protein Homology	CDS	38311	39945	.	+	0	ID=cds-0305;Parent=gene-0305;Name=P0305;gbkey=CDS;Ontology_term=GO:0006457,GO:0005515,GO:0016887;...

Matching features.tsv:

Name	priA
ID	gene-0455

Output

  • A Gzip-compressed JSON file (.json.gz) is created at the output location.
  • Contains harmonized and filtered variant calls, annotations, and metadata.

Processing Details

How MUSIAL Parses and Processes Features

Matching and Unique Identification: MUSIAL extracts features from the 9th column of a GFF3 annotation file using attribute key-value matching (e.g., Name). After matching, MUSIAL assigns a unique identifier (UID) to each feature using the Parent, ID, locus_tag attributes, or location (CONTIG:START..END), prioritizing in that order. This UID is used to consolidate multiple entries and ensure a consistent reference across child/parent relationships.

Hierarchy Construction: MUSIAL reconstructs a Sequence Ontology (SO)-compliant hierarchy from the parsed features. Each feature's type is validated against supported ontology levels, currently comprising: region; gene, pseudogene; mRNA, tRNA, rRNA, tmRNA, ncRNA, SRP_RNA, RNase_P_RNA; CDS, exon. Parent–child relationships (e.g., gene → mRNA → CDS) are restored or inferred based on Parent attributes or logical inference rules, e.g., the gene of a CDS must include the position interval of the CDS.

Feature Correction and Imputation: Features with incomplete hierarchies are auto-corrected: If only a CDS is matched, MUSIAL will infer a corresponding mRNA and gene feature. Redundant or ambiguous entries (e.g., multiple conflicting mRNA/CDS entries) are cleaned up. Features with unsupported SO types or invalid locations are discarded with warnings.

The final feature set includes cleaned, uniquely identified, SO-validated elements ready for downstream use.