Task: build

The build task is the initial step in any MUSIAL analysis. It constructs a MUSIAL storage file by aggregating and harmonizing variant calls and generates a compressed JSON file that serves as input for subsequent tasks.

java -jar MUSIAL-v2.4.0.jar build -C <path-to-config.json>

Required argument:

-C, --configuration <arg>
Path to a JSON file specifying the build configuration.

Configuration File

The build configuration is a JSON file validated against the build schema and should look somewhat like the following example:

{
  "reference": "reference.fasta",
  "annotation": "annotation.gff3",
  "features": "features.tsv",
  "vcfInput": ["sample1.vcf", "other_samples/"],
  "vcfMeta": "metadata.csv",
  "output": "musial_storage.json.gz",
  "minimalCoverage": 3.0,
  "minimalFrequency": 0.65,
  "storeFiltered": false,
  "skipSnpEff": false,
  "skipProteoformInference": true
}

The individual properties of the configuration file are described briefly below, while a comprehensive overview with all matching patterns can be viewed at buildConfigurationSchema.json

Field	Description
`reference`	Path to a FASTA file, defines reference sequences.
`annotation`	Path to a GFF3 file, defines features on the reference sequence.
`features`	Path to a `.tsv`/`.csv` file describing which GFF3 features to extract.
`excludedPositions`	BED/TSV/CSV file listing positions to exclude.
`excludedVariants`	TSV/CSV file listing variant calls to exclude.
`vcfInput`*	List of VCF files or directories containing VCFs.
`vcfMeta`	CSV/TSV file with sample metadata. The first column must match genotypes/sample names in VCF.
`output`*	Output path for the MUSIAL storage file. Defaults to `musial_storage_<date>.json.gz`, if a directory is provided.
`minimalCoverage`	Minimum total read depth for accepting a variant (default: 3).
`minimalFrequency`	Minimum fraction of variant-supporting reads at a position (default: 0.65).
`storeFiltered`	If true, filtered variants are retained with ambiguous content N (default: false).
`skipSnpEff`	If true, skips variant effect prediction using SnpEff (default: false).
`skipProteoformInference`	If true, skips proteoform inference (default: false).

^{* Required}

Data Inputs and Behavior

VCF Input (`vcfInput`)

Supports directories or VCF files. All VCF files contained in the specified directories are parsed.
All VCF input files should comply with the VCF v4.5 specification.
Each alternative genotype must have an AD info property. Each homozygous reference must have a DP info property.
Irrespective of the ploidy in the VCF files, only the most common alternative is retained as a variant.
Both single- and multi-sample (cohort) VCFs are supported.
- If a variant call for a sample is present in several files, the read depths are totaled.
- Sample names in a VCF file can be supplemented by a suffix $..., which merges listed variant calls or totals their read depths.
VCF files are temporarily copied into the system’s temporary directory for processing. This can be changed using: java -Djava.io.tmpdir=/path/to/tmpdir.

Metadata (`vcfMeta`)

Entries in the first column are matched against sample names in the VCF genotype fields.
Remaining columns can be arbitrary annotations and are stored as sample attributes.

Reference and Annotation

GFF3 must comply with the GFF3 specification and contain correct parent-child hierarchy.
An index of the FASTA reference is created in memory.
GFF3 must not contain embedded FASTA sequence data.
FASTA files must not end with a double line break.
The following input rules apply:

Condition	Behavior
No `reference` given	The `CHROM` information from VCF files is resolved into features, ranging from 1 to the largest position at which a variant was observed.
Only `reference` given	Each FASTA contig becomes a feature.
`reference` + `annotation`	All GFF3 entries are used.
`reference` + `annotation` + `features`	Only matched GFF3 entries are used.
`annotation` without `reference`	❌ Not allowed.
`features` without `annotation`	❌ Not allowed.

Feature Selection via GFF3

Features are defined using key-value pairs matched against the 9th column of the GFF3 annotation file.

Example GFF3:

Contig1	Genbank	gene	22085	24172	.	+	.	ID=gene-0230;Name=priA;gene=priA;gbkey=Gene;...
Contig1	Protein Homology	CDS	22085	24172	.	+	0	ID=cds-0230;Parent=gene-0230;Name=P0230;gbkey=CDS;product=primosomal protein N';...
Contig1	RefSeq	gene	38311	39945	.	-	.	ID=gene-0305;Name=groL;gene=groL;gbkey=Gene;...
Contig1	Protein Homology	CDS	38311	39945	.	+	0	ID=cds-0305;Parent=gene-0305;Name=P0305;gbkey=CDS;Ontology_term=GO:0006457,GO:0005515,GO:0016887;...

Matching features.tsv:

Name	priA
ID	gene-0455

Output

A Gzip-compressed JSON file (.json.gz) is created at the output location.
Contains harmonized and filtered variant calls, annotations, and metadata.

Processing Details

How MUSIAL Parses and Processes Features

Matching and Unique Identification: MUSIAL extracts features from the 9th column of a GFF3 annotation file using attribute key-value matching (e.g., Name). After matching, MUSIAL assigns a unique identifier (UID) to each feature using the Parent, ID, locus_tag attributes, or location (CONTIG:START..END), prioritizing in that order. This UID is used to consolidate multiple entries and ensure a consistent reference across child/parent relationships.

Hierarchy Construction: MUSIAL reconstructs a Sequence Ontology (SO)-compliant hierarchy from the parsed features. Each feature's type is validated against supported ontology levels, currently comprising: region; gene, pseudogene; mRNA, tRNA, rRNA, tmRNA, ncRNA, SRP_RNA, RNase_P_RNA; CDS, exon. Parent–child relationships (e.g., gene → mRNA → CDS) are restored or inferred based on Parent attributes or logical inference rules, e.g., the gene of a CDS must include the position interval of the CDS.

Feature Correction and Imputation: Features with incomplete hierarchies are auto-corrected: If only a CDS is matched, MUSIAL will infer a corresponding mRNA and gene feature. Redundant or ambiguous entries (e.g., multiple conflicting mRNA/CDS entries) are cleaned up. Features with unsupported SO types or invalid locations are discarded with warnings.

The final feature set includes cleaned, uniquely identified, SO-validated elements ready for downstream use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Task: build

Configuration File

Data Inputs and Behavior

VCF Input (`vcfInput`)

Metadata (`vcfMeta`)

Reference and Annotation

Feature Selection via GFF3

Output

Processing Details

How MUSIAL Parses and Processes Features

Uh oh!

Clone this wiki locally

Task: build

Configuration File

Data Inputs and Behavior

VCF Input (vcfInput)

Metadata (vcfMeta)

Reference and Annotation

Feature Selection via GFF3

Output

Processing Details

How MUSIAL Parses and Processes Features

Uh oh!

Clone this wiki locally

VCF Input (`vcfInput`)

Metadata (`vcfMeta`)