Skip to content

Data Format

oluwatosin oluwadare edited this page Mar 1, 2018 · 21 revisions

GenomeFlow generates a text file in the medium file format.

Medium file format

A whitespace separated file that contains, on each line:

<readname> <str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2> <mapq1> <mapq2>

  • str = strand (0 for forward, anything else for reverse)
  • chr = chromosome (must be a chromosome in the genome)
  • pos = position
  • frag = restriction site fragment
  • mapq = mapping quality score

If not using the restriction site file option, frag will be ignored, but please see above note on dummy values. If not using mapping quality filter, mapq will be ignored. readname and strand are also not currently stored within .hic files.

Test data is the GM06990 cell line data, that can be downloaded from link below:

More details about other file formats can be found here


hic file

The .hic file is a binary file containing compressed contact matrices at many resolutions, facilitating visualization and analysis at multiple scales. This binary file contains contact matrices at different resolutions and normalized by different methods The .hic file format is described extensively in Durand and Shamim et al., 2016

To create an hic file, Convert mapped Hi-C reads to .hic file

Sparse contact matrix

A contact matrix in sparse matrix format contains 3 column of data. Each row line represents a contact by three numbers: position1, postion2, interaction frequency

Square contact matrix

A N by N contact matrix derived from Hi-C data, where N is the number of equal-sized regions of a chromosome. A full matrix representing all the contact regions.

FASTA file

A FASTA file contains a read name followed by the sequence or read(a string of A,C,T,G, or N). An example of one of these reads for RNASeq might be:

>Flow cell number: lane number: chip coordinates etc.
ATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTA

Read more about FASTA files in the article.

FASTQ file

FASTQ is another DNA sequence file format that extends the FASTA format with the ability to store the sequence quality. The quality scores are often represented in ASCII characters, '!' being the lowest and '~' being the highest, in increasing ASCII value. It would look something like this

@Flow cell number: lane number: chip coordinates etc.
ATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTA
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Read more about FASTQ files in the article

SAM file

SAM is format for representing sequence alignment information from a read aligner. SAM can also store unaligned sequence information. It represents sequence information in respect to a given reference sequence. The information is stored in a series of tab delimited ascii columns. The full SAM format specification is available at SAM samtools

BAM file

BAM files which are just compressed binary versions of SAM files. A BAM binary format (. bam) obtained by converting a SAM file into a BAM file. Check samtools for the BAM format specification and the tools for post-processing the alignment.

Protein Data Bank (pdb) file format

The Protein Data Bank (pdb) file format is a textual file format describing the three-dimensional structures of molecules. .pdb format is a standard for files containing atomic coordinates. Read more about pdb in this article

Genome Scale System (gss) file format

This format was introduced to allow for the multi-scale system in GMOL .The gss file format allows for the visualization of a genome at multiple scales instead of having to view each PDB structure file individually. A description of the gss file is given in this article.

Clone this wiki locally