-
Notifications
You must be signed in to change notification settings - Fork 4
Data Format
GenomeFlow generates a text file in the medium file format.
A whitespace separated file that contains, on each line:
<readname> <str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2> <mapq1> <mapq2>
- str = strand (0 for forward, anything else for reverse)
- chr = chromosome (must be a chromosome in the genome)
- pos = position
- frag = restriction site fragment
- mapq = mapping quality score
If not using the restriction site file option, frag will be ignored, but please see above note on dummy values. If not using mapping quality filter, mapq will be ignored. readname and strand are also not currently stored within .hic files.
- Bowtie2: http://sysbio.rnet.missouri.edu/bdm_download/GenomeFlow/GM06990/GenomeFlow_formatted.bowtie2.input
- Bwa: http://sysbio.rnet.missouri.edu/bdm_download/GenomeFlow/GM06990/GenomeFlow_formatted.bwa.input
Other Input file formats allowed by GenomeFlow are:
A whitespace separated file that contains, on each line
<str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2>
- str = strand (0 for forward, anything else for reverse)
- chr = chromosome (must be a chromosome in the genome)
- pos = position
- frag = restriction site fragment If not using the restriction site file option, frag will be ignored, but please see above note on dummy values. readname and strand are also not currently stored within .hic files.
This format is useful for reading in already processed files, e.g. those that have been already binned and/or normalized; this format can be easily used in conjunction with the -r flag to create a .hic file that contains a single resolution.
A whitespace separated file that contains, on each line
<str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2> <score>
- str = strand (0 for forward, anything else for reverse)
- chr = chromosome (must be a chromosome in the genome)
- pos = position
- frag = restriction site fragment
- score = the score imputed to this read If not using the restriction site file option, frag will be ignored, but please see above note on dummy values. readname and strand are also not currently stored within .hic files.
The long format is used by Juicer and takes in directly the merged_nodups.txt file.
A whitespace separated file that contains, on each line
<str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2> <mapq1> <cigar1> <sequence1> <mapq2> <cigar2> <sequence2> <readname1> <readname2>
- str = strand (0 for forward, anything else for reverse)
- chr = chromosome (must be a chromosome in the genome)
- pos = position
- frag = restriction site fragment
- mapq = mapping quality score
- cigar = cigar string as reported by aligner
- sequence = DNA sequence If not using the restriction site file option, frag will be ignored, but please see above note on dummy values. If not using mapping quality filter, mapq will be ignored. readname, strand, cigar, and sequence are also not currently stored within .hic files.
A file that follows the 4DN DCIC format specification (the 4DN DCIC format specification).
See the link for more information. Briefly, there should be a header with the first seven columns reserved:
## pairs format v1.0
#columns: readID chr1 position1 chr2 position2 strand1 strand2
Please refer here for more details about all the file formats.
The .hic file is a binary file containing compressed contact matrices at many resolutions, facilitating visualization and analysis at multiple scales. This binary file contains contact matrices at different resolutions and normalized by different methods The .hic file format is described extensively in Durand and Shamim et al., 2016
To create an hic file, Convert mapped Hi-C reads to .hic file
A contact matrix in sparse matrix format contains 3 column of data. Each row line represents a contact by three numbers separated by whitespaces: <position1> <postion2> <interaction frequency>
A N by N contact matrix derived from Hi-C data, where N is the number of equal-sized regions of a chromosome. A full matrix representing all the contact regions.
A FASTA file contains a read name followed by the sequence or read(a string of A,C,T,G, or N). An example of one of these reads for RNASeq might be:
>Flow cell number: lane number: chip coordinates etc.
ATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTA
Read more about FASTA files in the article.
FASTQ is another DNA sequence file format that extends the FASTA format with the ability to store the sequence quality. The quality scores are often represented in ASCII characters, '!' being the lowest and '~' being the highest, in increasing ASCII value. It would look something like this
@Flow cell number: lane number: chip coordinates etc.
ATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTA
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Read more about FASTQ files in the article
SAM is format for representing sequence alignment information from a read aligner. SAM can also store unaligned sequence information. It represents sequence information in respect to a given reference sequence. The information is stored in a series of tab delimited ascii columns. The full SAM format specification is available at SAM samtools
BAM files which are just compressed binary versions of SAM files. A BAM binary format (. bam) obtained by converting a SAM file into a BAM file. Check samtools for the BAM format specification and the tools for post-processing the alignment.
The Protein Data Bank (pdb) file format is a textual file format describing the three-dimensional structures of molecules. .pdb format is a standard for files containing atomic coordinates. Read more about pdb in this article
This format was introduced to allow for the multi-scale system in GMOL .The gss file format allows for the visualization of a genome at multiple scales instead of having to view each PDB structure file individually. A description of the gss file is given in this article.
- Create reference genome index
- Mapping raw FASTQ files
- Filter a BAM alignment file
- Convert a BAM file to Medium file format
- HiC-Express