Data Format

GenomeFlow generates a text file in the medium file format.

Medium file format

A whitespace separated file that contains, on each line:

<readname> <str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2> <mapq1> <mapq2>

str = strand (0 for forward, anything else for reverse)
chr = chromosome (must be a chromosome in the genome)
pos = position
frag = restriction site fragment
mapq = mapping quality score

If not using the restriction site file option, frag will be ignored, but please see above note on dummy values. If not using mapping quality filter, mapq will be ignored. readname and strand are also not currently stored within .hic files.

Test data is the GM06990 cell line data, that can be downloaded from link below:

Other Input file formats allowed by GenomeFlow are:

Short format

A whitespace separated file that contains, on each line
<str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2>

str = strand (0 for forward, anything else for reverse)
chr = chromosome (must be a chromosome in the genome)
pos = position
frag = restriction site fragment If not using the restriction site file option, frag will be ignored, but please see above note on dummy values. readname and strand are also not currently stored within .hic files.

Short with score format

This format is useful for reading in already processed files, e.g. those that have been already binned and/or normalized; this format can be easily used in conjunction with the -r flag to create a .hic file that contains a single resolution. A whitespace separated file that contains, on each line
<str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2> <score>

str = strand (0 for forward, anything else for reverse)
chr = chromosome (must be a chromosome in the genome)
pos = position
frag = restriction site fragment
score = the score imputed to this read If not using the restriction site file option, frag will be ignored, but please see above note on dummy values. readname and strand are also not currently stored within .hic files.

Long format

The long format is used by Juicer and takes in directly the merged_nodups.txt file. A whitespace separated file that contains, on each line
<str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2> <mapq1> <cigar1> <sequence1> <mapq2> <cigar2> <sequence2> <readname1> <readname2>

str = strand (0 for forward, anything else for reverse)
chr = chromosome (must be a chromosome in the genome)
pos = position
frag = restriction site fragment
mapq = mapping quality score
cigar = cigar string as reported by aligner
sequence = DNA sequence If not using the restriction site file option, frag will be ignored, but please see above note on dummy values. If not using mapping quality filter, mapq will be ignored. readname, strand, cigar, and sequence are also not currently stored within .hic files.

4DN DCIC format

A file that follows the 4DN DCIC format specification (the 4DN DCIC format specification). See the link for more information. Briefly, there should be a header with the first seven columns reserved:
## pairs format v1.0 #columns: readID chr1 position1 chr2 position2 strand1 strand2

Please refer here for more details about all the file formats.

hic file

The .hic file is a binary file containing compressed contact matrices at many resolutions, facilitating visualization and analysis at multiple scales. This binary file contains contact matrices at different resolutions and normalized by different methods The .hic file format is described extensively in Durand and Shamim et al., 2016

To create an hic file, Convert mapped Hi-C reads to .hic file

Sparse contact matrix

A contact matrix in sparse matrix format contains 3 column of data. Each row line represents a contact by three numbers separated by whitespaces: <position1> <postion2> <interaction frequency>

Square contact matrix

A N by N contact matrix derived from Hi-C data, where N is the number of equal-sized regions of a chromosome. A full matrix representing all the contact regions.

FASTA file

A FASTA file contains a read name followed by the sequence or read(a string of A,C,T,G, or N). An example of one of these reads for RNASeq might be:

>Flow cell number: lane number: chip coordinates etc.
ATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTA

Read more about FASTA files in the article.

FASTQ file

FASTQ is another DNA sequence file format that extends the FASTA format with the ability to store the sequence quality. The quality scores are often represented in ASCII characters, '!' being the lowest and '~' being the highest, in increasing ASCII value. It would look something like this

@Flow cell number: lane number: chip coordinates etc.
ATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTA
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Read more about FASTQ files in the article

SAM file

SAM is format for representing sequence alignment information from a read aligner. SAM can also store unaligned sequence information. It represents sequence information in respect to a given reference sequence. The information is stored in a series of tab delimited ascii columns. The full SAM format specification is available at SAM samtools

BAM file

BAM files which are just compressed binary versions of SAM files. A BAM binary format (. bam) obtained by converting a SAM file into a BAM file. Check samtools for the BAM format specification and the tools for post-processing the alignment.

Protein Data Bank (pdb) file format

The Protein Data Bank (pdb) file format is a textual file format describing the three-dimensional structures of molecules. .pdb format is a standard for files containing atomic coordinates. Read more about pdb in this article

Genome Scale System (gss) file format

This format was introduced to allow for the multi-scale system in GMOL .The gss file format allows for the visualization of a genome at multiple scales instead of having to view each PDB structure file individually. A description of the gss file is given in this article.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data Format

Medium file format

Test data is the GM06990 cell line data, that can be downloaded from link below:

Short format

Short with score format

Long format

4DN DCIC format

hic file

Sparse contact matrix

Square contact matrix

FASTA file

FASTQ file

SAM file

BAM file

Protein Data Bank (pdb) file format

Genome Scale System (gss) file format

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GenomeFlow

1D Functions tools

2D-Functions tools

3D-Functions tools

Clone this wiki locally