-
Notifications
You must be signed in to change notification settings - Fork 4
Data Format
GenomeFlow generates a text file in the medium file format.
A whitespace separated file that contains, on each line:
<readname> <str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2> <mapq1> <mapq2>
- str = strand (0 for forward, anything else for reverse)
- chr = chromosome (must be a chromosome in the genome)
- pos = position
- frag = restriction site fragment
- mapq = mapping quality score
If not using the restriction site file option, frag will be ignored, but please see above note on dummy values. If not using mapping quality filter, mapq will be ignored. readname and strand are also not currently stored within .hic files.
- Bowtie2: http://sysbio.rnet.missouri.edu/bdm_download/GenomeFlow/GM06990/GenomeFlow_formatted.bowtie2.input
- Bwa: http://sysbio.rnet.missouri.edu/bdm_download/GenomeFlow/GM06990/GenomeFlow_formatted.bwa.input
More details about other file formats can be found here
The .hic file is a binary file containing compressed contact matrices at many resolutions, facilitating visualization and analysis at multiple scales. This binary file contains contact matrices at different resolutions and normalized by different methods The .hic file format is described extensively in Durand and Shamim et al., 2016
To create an hic file, use the GenomeFlow 2D-Function Convert mapped Hi-C reads to hic format file
A FASTA file contains a read name followed by the sequence or read(a string of A,C,T,G, or N). An example of one of these reads for RNASeq might be:
>Flow cell number: lane number: chip coordinates etc.
ATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTA
Read more about FASTA files in the article.
FASTQ is another DNA sequence file format that extends the FASTA format with the ability to store the sequence quality. The quality scores are often represented in ASCII characters, '!' being the lowest and '~' being the highest, in increasing ASCII value. It would look something like this
@Flow cell number: lane number: chip coordinates etc.
ATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTA
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Read more about FASTQ files in the article
SAM is format for representing sequence alignment information from a read aligner. SAM can also store unaligned sequence information. It represents sequence information in respect to a given reference sequence. The information is stored in a series of tab delimited ascii columns. The full SAM format specification is available at SAM samtools
BAM files which are just compressed binary versions of SAM files. A BAM binary format (. bam) obtained by converting a SAM file into a BAM file. Check samtools for the BAM format specification and the tools for post-processing the alignment.
- Create reference genome index
- Mapping raw FASTQ files
- Filter a BAM alignment file
- Convert a BAM file to Medium file format
- HiC-Express