-
Notifications
You must be signed in to change notification settings - Fork 4
Data Format
GenomeFlow generates a text file in the medium file format.
A whitespace separated file that contains, on each line:
<readname> <str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2> <mapq1> <mapq2>
- str = strand (0 for forward, anything else for reverse)
- chr = chromosome (must be a chromosome in the genome)
- pos = position
- frag = restriction site fragment
- mapq = mapping quality score
If not using the restriction site file option, frag will be ignored, but please see above note on dummy values. If not using mapping quality filter, mapq will be ignored. readname and strand are also not currently stored within .hic files.
- Bowtie2: http://sysbio.rnet.missouri.edu/bdm_download/GenomeFlow/GM06990/GenomeFlow_formatted.bowtie2.input
- Bwa: http://sysbio.rnet.missouri.edu/bdm_download/GenomeFlow/GM06990/GenomeFlow_formatted.bwa.input
More details about other file formats can be found here
The .hic file is a binary file containing compressed contact matrices at many resolutions, facilitating visualization and analysis at multiple scales. This binary file contains contact matrices at different resolutions and normalized by different methods The .hic file format is described extensively in Durand and Shamim et al., 2016
To create an hic file, Convert mapped Hi-C reads to .hic file
A contact matrix in sparse matrix format contains 3 column of data. Each row line represents a contact by three numbers: position1, postion2, interaction frequency
A N by N contact matrix derived from Hi-C data, where N is the number of equal-sized regions of a chromosome. A full matrix representing all the contact regions.
A FASTA file contains a read name followed by the sequence or read(a string of A,C,T,G, or N). An example of one of these reads for RNASeq might be:
>Flow cell number: lane number: chip coordinates etc.
ATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTA
Read more about FASTA files in the article.
FASTQ is another DNA sequence file format that extends the FASTA format with the ability to store the sequence quality. The quality scores are often represented in ASCII characters, '!' being the lowest and '~' being the highest, in increasing ASCII value. It would look something like this
@Flow cell number: lane number: chip coordinates etc.
ATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTA
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Read more about FASTQ files in the article
SAM is format for representing sequence alignment information from a read aligner. SAM can also store unaligned sequence information. It represents sequence information in respect to a given reference sequence. The information is stored in a series of tab delimited ascii columns. The full SAM format specification is available at SAM samtools
BAM files which are just compressed binary versions of SAM files. A BAM binary format (. bam) obtained by converting a SAM file into a BAM file. Check samtools for the BAM format specification and the tools for post-processing the alignment.
The Protein Data Bank (pdb) file format is a textual file format describing the three-dimensional structures of molecules. .pdb format is a standard for files containing atomic coordinates. Read more about pdb in this article
This format was introduced to allow for the multi-scale system in GMOL .The gss file format allows for the visualization of a genome at multiple scales instead of having to view each PDB structure file individually. A description of the gss file is given in this article.
- Create reference genome index
- Mapping raw FASTQ files
- Filter a BAM alignment file
- Convert a BAM file to Medium file format
- HiC-Express