-
Notifications
You must be signed in to change notification settings - Fork 1
Task: build
The build
task is the initial step in any MUSIAL analysis. It constructs a MUSIAL storage file by aggregating and harmonizing variant calls and generates a compressed JSON file that serves as input for subsequent tasks.
java -jar MUSIAL-v2.4.0.jar build -C <path-to-config.json>
Required argument:
-
-C, --configuration <arg>
Path to a JSON file specifying the build configuration.
The build configuration is a JSON file validated against the build schema and should look somewhat like the following example:
{
"reference": "reference.fasta",
"annotation": "annotation.gff3",
"features": "features.tsv",
"vcfInput": ["sample1.vcf", "other_samples/"],
"vcfMeta": "metadata.csv",
"output": "musial_storage.json.gz",
"minimalCoverage": 3.0,
"minimalFrequency": 0.65,
"storeFiltered": false,
"skipSnpEff": false,
"skipProteoformInference": true
}
The individual properties of the configuration file are described briefly below, while a comprehensive overview with all matching patterns can be viewed at buildConfigurationSchema.json
Field | Description |
---|---|
reference |
Path to a FASTA file, defines reference sequences. |
annotation |
Path to a GFF3 file, defines features on the reference sequence. |
features |
Path to a .tsv /.csv file describing which GFF3 features to extract. |
excludedPositions |
BED/TSV/CSV file listing positions to exclude. |
excludedVariants |
TSV/CSV file listing variant calls to exclude. |
vcfInput * |
List of VCF files or directories containing VCFs. |
vcfMeta |
CSV/TSV file with sample metadata. The first column must match genotypes/sample names in VCF. |
output * |
Output path for the MUSIAL storage file. Defaults to musial_storage_<date>.json.gz , if a directory is provided. |
minimalCoverage |
Minimum total read depth for accepting a variant (default: 3). |
minimalFrequency |
Minimum fraction of variant-supporting reads at a position (default: 0.65). |
storeFiltered |
If true, filtered variants are retained with ambiguous content N (default: false). |
skipSnpEff |
If true, skips variant effect prediction using SnpEff (default: false). |
skipProteoformInference |
If true, skips proteoform inference (default: false). |
* Required
- Supports directories or VCF files. All VCF files contained in the specified directories are parsed.
- All VCF input files should comply with the VCF v4.5 specification.
- Each alternative genotype must have an
AD
info property. Each homozygous reference must have aDP
info property. - Irrespective of the ploidy in the VCF files, only the most common alternative is retained as a variant.
- Both single- and multi-sample (cohort) VCFs are supported.
- If a variant call for a sample is present in several files, the read depths are totaled.
- Sample names in a VCF file can be supplemented by a suffix
$...
, which merges listed variant calls or totals their read depths.
- VCF files are temporarily copied into the system’s temporary directory for processing. This can be changed using:
java -Djava.io.tmpdir=/path/to/tmpdir
.
- Entries in the first column are matched against sample names in the VCF genotype fields.
- Remaining columns can be arbitrary annotations and are stored as sample attributes.
- GFF3 must comply with the GFF3 specification and contain correct parent-child hierarchy.
- An index of the FASTA reference is created in memory.
- GFF3 must not contain embedded FASTA sequence data.
- FASTA files must not end with a double line break.
- The following input rules apply:
Condition | Behavior |
---|---|
No reference given |
The CHROM information from VCF files is resolved into features, ranging from 1 to the largest position at which a variant was observed. |
Only reference given |
Each FASTA contig becomes a feature. |
reference + annotation
|
All GFF3 entries are used. |
reference + annotation + features
|
Only matched GFF3 entries are used. |
annotation without reference
|
❌ Not allowed. |
features without annotation
|
❌ Not allowed. |
Features are defined using key-value pairs matched against the 9th column of the GFF3 annotation file.
Example GFF3:
Contig1 Genbank gene 22085 24172 . + . ID=gene-0230;Name=priA;gene=priA;gbkey=Gene;...
Contig1 Protein Homology CDS 22085 24172 . + 0 ID=cds-0230;Parent=gene-0230;Name=P0230;gbkey=CDS;product=primosomal protein N';...
Contig1 RefSeq gene 38311 39945 . - . ID=gene-0305;Name=groL;gene=groL;gbkey=Gene;...
Contig1 Protein Homology CDS 38311 39945 . + 0 ID=cds-0305;Parent=gene-0305;Name=P0305;gbkey=CDS;Ontology_term=GO:0006457,GO:0005515,GO:0016887;...
Matching features.tsv
:
Name priA
ID gene-0455
- A Gzip-compressed JSON file (
.json.gz
) is created at theoutput
location. - Contains harmonized and filtered variant calls, annotations, and metadata.
Matching and Unique Identification: MUSIAL extracts features from the 9th column of a GFF3 annotation file using attribute key-value matching (e.g., Name). After matching, MUSIAL assigns a unique identifier (UID) to each feature using the Parent, ID, locus_tag attributes, or location (CONTIG:START..END), prioritizing in that order. This UID is used to consolidate multiple entries and ensure a consistent reference across child/parent relationships.
Hierarchy Construction: MUSIAL reconstructs a Sequence Ontology (SO)-compliant hierarchy from the parsed features. Each feature's type is validated against supported ontology levels, currently comprising: region; gene, pseudogene; mRNA, tRNA, rRNA, tmRNA, ncRNA, SRP_RNA, RNase_P_RNA; CDS, exon. Parent–child relationships (e.g., gene → mRNA → CDS) are restored or inferred based on Parent attributes or logical inference rules, e.g., the gene of a CDS must include the position interval of the CDS.
Feature Correction and Imputation: Features with incomplete hierarchies are auto-corrected: If only a CDS is matched, MUSIAL will infer a corresponding mRNA and gene feature. Redundant or ambiguous entries (e.g., multiple conflicting mRNA/CDS entries) are cleaned up. Features with unsupported SO types or invalid locations are discarded with warnings.
The final feature set includes cleaned, uniquely identified, SO-validated elements ready for downstream use.