Skip to content

Task: build

Simon Hackl edited this page May 19, 2025 · 5 revisions

Task: build

The build task is the initial step in any MUSIAL analysis. It constructs a MUSIAL storage file by aggregating and harmonizing variant calls and generates a compressed JSON file that serves as input for subsequent tasks.

java -jar MUSIAL-v2.4.0.jar build -C <path-to-config.json>

Required argument:

  • -C, --configuration <arg>
    Path to a JSON file specifying the build configuration.

⚙️ Configuration File

The build configuration is a JSON file validated against the build schema and should look somewhat like the following example:

{
  "reference": "reference.fasta",
  "annotation": "annotation.gff3",
  "features": "features.tsv",
  "vcfInput": ["sample1.vcf", "other_samples/"],
  "vcfMeta": "metadata.csv",
  "output": "musial_storage.json.gz",
  "minimalCoverage": 3.0,
  "minimalFrequency": 0.65,
  "storeFiltered": false,
  "skipSnpEff": false,
  "skipProteoformInference": true
}

The individual properties of the configuration file are described briefly below, while a comprehensive overview with all matching patterns can be viewed at buildConfigurationSchema.json

Field Description
reference Path to a FASTA file, defines reference sequences.
annotation Path to a GFF3 file, defines features on the reference sequence.
features Path to a .tsv/.csv file describing which GFF3 features to extract.
excludedPositions BED/TSV/CSV file listing positions to exclude.
excludedVariants TSV/CSV file listing variant calls to exclude.
vcfInput List of VCF files or directories containing VCFs.
vcfMeta CSV/TSV file with sample metadata. The first column must match genotypes/sample names in VCF.
output Output path for the MUSIAL storage file. Defaults to musial_storage_<date>.json.gz, if a directory is provided.
minimalCoverage Minimum total read depth for accepting a variant (default: 3).
minimalFrequency Minimum fraction of variant-supporting reads at a position (default: 0.65).
storeFiltered If true, filtered variants are retained with ambiguous content (N).
skipSnpEff If true, skips variant effect prediction using SnpEff.
skipProteoformInference If true, skips proteoform inference.

📁 Data Inputs and Behavior

VCF Input (vcfInput):

  • Supports directories or VCF files. All VCF files contained in the specified directories are parsed.
  • Both single- and multi-sample (cohort) VCFs are supported.
    • If a variant call for a sample is present in several files, the read depths are totaled.
    • Sample names in a VCF file can be supplemented by a suffix $..., which merges listed variant calls or totals their read depths.
  • VCF files are temporarily copied into the system’s temporary directory for processing. This can be changed using: java -Djava.io.tmpdir=/path/to/tmpdir.
  • All specified VCF input files should comply with the VCF v4.5 specification.
  • Each alternative genotype must have an AD info property. Each homozygous reference must have a DP info property.
  • Irrespective of the ploidy in the VCF files, only the most common alternative is retained as a variant.

Metadata (vcfMeta):

  • Entries in the first column are matched against sample names in the VCF genotype fields.
  • Remaining columns can be arbitrary annotations and are stored as sample attributes.

Reference and Annotation:

  • GFF3 must comply with the GFF3 specification and contain correct parent-child hierarchy.
  • An index of the FASTA reference is created in memory.
  • GFF3 must not contain embedded FASTA sequence data.
  • FASTA files should not end with a double line break.
  • The following input rules apply:
Condition Behavior
No reference given The CHROM information from VCF files is resolved into features, ranging from 1 to the largest position at which a variant was observed.
Only reference given Each FASTA contig becomes a feature.
reference + annotation All GFF3 entries are used.
reference + annotation + features Only matched GFF3 entries are used.
annotation without reference ❌ Not allowed.
features without annotation ❌ Not allowed.

🧬 Feature Selection via GFF3

Features are defined using key-value pairs matched against the 9th column of the GFF3 annotation file.

Example GFF3:

Contig1	Genbank	gene	22085	24172	.	+	.	ID=gene-0230;Name=priA;gene=priA;gbkey=Gene;...
Contig1	Protein Homology	CDS	22085	24172	.	+	0	ID=cds-0230;Parent=gene-0230;Name=P0230;gbkey=CDS;product=primosomal protein N';...
Contig1	RefSeq	gene	38311	39945	.	-	.	ID=gene-0305;Name=groL;gene=groL;gbkey=Gene;...
Contig1	Protein Homology	CDS	38311	39945	.	+	0	ID=cds-0305;Parent=gene-0305;Name=P0305;gbkey=CDS;Ontology_term=GO:0006457,GO:0005515,GO:0016887;...

Matching features.tsv:

Name	priA
ID	gene-0455

📦 Output

  • A Gzip-compressed JSON file (.json.gz) is created at the output location.
  • Contains harmonized and filtered variant calls, annotations, and metadata.
Clone this wiki locally