Skip to content

Noble-Lab/hic-helper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HiC Helper

Welcome to the HiC Helper repository. This repository contains a collection of tools and scripts for processing Hi-C (High-throughput chromosome conformation capture) data.

HiC Helper provides a set of utilities to preprocess, analyze, and visualize HiC data. Whether you are working with raw Hi-C data or processed matrices, this toolkit aims to simplify the analysis pipeline and provide useful functions for exploring chromatin interactions.

Please refer to the documentation and code examples in this repository to learn more about the available tools and how to use them effectively.

Happy HiC processing!

Copyright (C) 2024 Xiao Wang, Anupama Jha , Tangqi Fang and the University of Washington.

License: GPL v3.

Contact: William Stafford Noble (wnoble@uw.edu) and Sheng Wang (swang@cs.washington.edu).

For technical problems or questions, please reach out to Xiao Wang (wang3702@uw.edu) and Anupama Jha (anupamaj@uw.edu).

Add a pointer to readthedocs here, and then move the text in the subsequent section into readthedocs. ---Bill

Installation

Configure software dependency

conda env create -f environment.yml

Configure data dependency (optional)

bash set_up.sh

Scope

This repo aims to serve as a library to allow lab members to easily process, analyze and visualize Hi-C and other associated data. Each script provides one simple function, accessible via the command line.

Organization

pre_processing
This directory includes pre-processing steps for Hi-C and associated data. Specifically, it includes processing for raw data (.fastq, .bam, .pairs), file format conversion (.hic, .cooler, .pkl, .bw), normalization, and merging.

downstream
This directory includes scripts for downstream analysis, including peak calling, loop detection, TAD detection, loop enrichment, and loop/peak overlap comparison.

analysis
This directory includes scripts for different analysis for different aspects of different data, including quality control, peak/loop distribution analysis, coverage analysis, correlation analysis, total reads, peak/loop overlap analysis, peak/loop strength analysis.

visualization
This directory includes scripts to do visualization for different purposes, including Hi-C submatrix visualization, loop visualization, loop APA, 1D signal visualization, contact statistics visualization.

Delete everything below this line. Instead, use sphynx to automatically generate the descriptions below from the comments in the Python code. ---Bill


HiC Data Format Conversion

1. cool2array.py

cool2array.py
This script is to convert .cool format to dict of arrays format.

python3 cool2array.py [input.cool] [output.pkl] [mode]

This is the full cool2array script, converting both intra, inter chromosome regions to array format.
The output array is saved in a pickle file as dict: [chrom1_chrom2]:[array] format.
Two modes are supported:

0: scipy coo_array format output; 
1: numpy array format output;
2: normed scipy coo_array format output; 
3: normed numpy array format output.

For different resolution, please simply input cool path as [xx.cool::resolutions/5000], here resolution specified is 5Kb. You can modify it to support different resolutions that you want to focus.

2. hic2array.py

hic2array.py

python3 hic2array.py [input.hic] [output.pkl] [resolution] [normalization_type] [mode]

This is the full hic2array script, converting both intra, inter chromosome regions to array format.
The output array is saved in a pickle file as dict: [chrom1_chrom2][norm_type]:[array] format.
[resolution] is used to specify the resolution that stored in the output array.
[normalization_type] supports the following type:

0: NONE normalization applied, save the raw data to array.
1: VC normalization; 
2: VC_SQRT normalization; 
3: KR normalization; 
4: SCALE normalization;
or a combination of types as a comma-separated list. (e.g. 0,3)

Four modes are supported for different format saving:

0: scipy coo_array format output; 
1: numpy array format output;
2: scipy csr_array format output (only include intra-chromsome region).
3: numpy array format output (only include intra-chromsome region).

If you do not want to maintain norm_type in the dict, please simply use hic2array_simple.py, which will save dict in format of [chrom1_chrom2]:[array].

3. array2hic.py

array2hic.py
This script is used to convert dict of arrays format to .hic format.

python3 array2hic.py [input.pkl] [output.hic] [resolution] [refer_genome_name] [mode]

The input pickle should be in a pickle file as dict: [chrom1_chrom2]:[array] format for common mode. Here array should be scipy sparce array/numpy array.
For intra-chromsome only, the dict format can be [chrom]:[array] in pickle files.
[output.hic] is the name of the output hic file.
[resolution] is used to specify the resolution that stored in the output array.
[refer_genome_name] is used to specify the reference genome name. For example, "hg38","hg19","mm10" are valid inputs.
[mode]: 0: all chromosome mode (scipy sparce array); 1: intra-chromosome mode(scipy sparce array); 2: all chromosome mode (numpy array); 3: intra-chromosome mode(numpy array).

4. array2png.py

array2png.py
This script is used to visualize Hi-C images from format.

python3 array2png.py [input.pkl] [output.png] [chrom1] [start_index1] [end_index1] [chrom2] [start_index2] [end_index2] [resolution] [max_value] [mode]

This is the full array2png script.
[input.pkl] is the path to the pickle file containing the array.
[input.pkl] format: [chrom1_chrom2]:[array] format for common mode. Here array should be scipy sparce array.
For intra-chromsome only, the dict format can be [chrom]:[array] in pickle files.
[output.png] is the name of the output png file.
[chrom1] is the name of the first chromosome.
[start_index1] is the start index of the first chromosome.
[end_index1] is the end index of the first chromosome.
[chrom2] is the name of the second chromosome.
[start_index2] is the start index of the second chromosome.
[end_index2] is the end index of the second chromosome.
[resolution] is the resolution of the input array.
[max_value] is the maximum threshold of the input array for figures.
[mode] is 0 for raw visualization, 1 for log visualization.
All index input should be absolute index counted by base.

5.hiccups_loop.py

hiccups_loop.py
Use HiCCUPs to detect loop from Hi-C input

python3 hiccups_loop.py [hicFile] [output_dir] [resolution]

[hicFile]: the path to the input hic file [String].
[output_dir]: the directory to the output loops [String].
[resolution]: the resolution of the input hic file [Integer].
Currently only support 5000,10000,25000 resolutions.
The output loop bedpe file will be saved in [output_dir]/merged_loops.bedpe.

6. loop_f1.py

loop_f1.py
Compute F1 metrics of predicted loop and ground truth loop.

python3 loop_f1.py [true.bed] [pred.bed] [resolution]

[true.bed]: the true peaks, in bed format
[pred.bed]: the predicted peaks, in bed format
[resolution]: the resolution of the Hi-C data

7. fastq2bam.py

fastq2bam.py
This script is used to convert fastq file to bam file.

python3 fastq2bam.py [fastq_file1] [fastq_file2] [refer.fa] [output_dir]

[fastq_file1]: the first fastq file.
[fastq_file2]: the second fastq file.
[refer.fa]: the reference genome file. You can download the reference genome file from https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.40/ for human.
You can also run set_up.sh to download the reference genome files for human and mouse.
[output_dir]: the output directory. The output file will be named as 4DN.sorted.bam under this direcotry.

8. fastq_4dn.py

fastq_4dn.py This script is used to convert fastq files to cool or hic files following 4DN's pipeline.

python3 fastq_4dn.py [fastq_file1] [fastq_file2] [refer.fa] [chrom_size_file] [output_dir] [mode] [number_cpu] [max_memory] [resume_flag]

[fastq_file1]: the first fastq file.
[fastq_file2]: the second fastq file.
[refer.fa]: the reference genome file. You can download the reference genome file from https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.40/ for human.
You can also run set_up.sh to download the reference genome files for human and mouse.
[chrom_size_file]: the chromosome size file. Example file: hg38.chrom.sizes for human genome build GRCh38
[output_dir]: the output directory. The output file will be named as 4DN.cool under this direcotry.
[mode]: the mode of the conversion. 0: convert to cool file; 1:convert to hic file
[number_cpu]: the number of cpu to use
[max_memory]: the max memory to use (GB)
[resume_flag]: 0: do not resume; 1: resume from the previously generated files. default should be 0.
Recommended running with 8 cores and 64GB memory.

Example Command:

After you run set_up.sh to download the reference genome files, you can run the following command to convert example fastq files to hic/cool files.

  1. Convert fastq files to cool files:
python3 fastq_4dn.py ../reference_data/sample_data/GM12878_SRR1658581_1pc_1_R1.h10000.fastq.gz ../reference_data/sample_data/GM12878_SRR1658581_1pc_1_R2.h10000.fastq.gz ../reference_data/hg19.fa ../reference_data/hg19.chrom.sizes output_test 0 32 128 0

The output file will be named as 4DN.cool under the "output_test" direcotry.

  1. Convert fastq files to hic files:
python3 fastq_4dn.py ../reference_data/sample_data/GM12878_SRR1658581_1pc_1_R1.h10000.fastq.gz ../reference_data/sample_data/GM12878_SRR1658581_1pc_1_R2.h10000.fastq.gz ../reference_data/hg19.fa ../reference_data/hg19.chrom.sizes output_test 1 32 128 0

The output file will be named as 4DN.hic under the "output_test" direcotry.

9. bam_4dn.py

bam_4dn.py
This script is used to convert bam file to cool or hic files following 4DN's pipeline.

python3 bam_4dn.py [input.bam] [chrom_size_file] [output_dir] [mode] [number_cpu] [max_memory] [resume_flag]

[input.bam]: the input bam file.
[chrom_size_file]: the chromosome size file. Example file: hg38.chrom.sizes for human genome build GRCh38
[output_dir]: the output directory. The output file will be named as 4DN.cool under this direcotry.
[mode]: the mode of the conversion. 0: convert to cool file; 1:convert to hic file
[number_cpu]: the number of cpu to use
[max_memory]: the max memory to use (GB)
[resume_flag]: 0: do not resume; 1: resume from the previously generated files. default should be 0.
Recommended running with 8 cores and 64GB memory.

10. pairs_4dn.py

pairs_4dn.py
This script is used to convert pairs file to cool or hic files following 4DN's pipeline.

python3 pairs_4dn.py [input.pairs.gz] [chrom_size_file] [output_dir] [mode] [number_cpu] [max_memory] [resume_flag]

[input.pairs]: the input pairs.gz file
. [chrom_size_file]: the chromosome size file. Example file: hg38.chrom.sizes for human genome build GRCh38
[output_dir]: the output directory. The output file will be named as 4DN.cool under this direcotry.
[mode]: the mode of the conversion. 0: convert to cool file; 1:convert to hic file
[number_cpu]: the number of cpu to use
[max_memory]: the max memory to use (GB)
[resume_flag]: 0: do not resume; 1: resume from the previously generated files. default should be 0.
Recommended running with 8 cores and 64GB memory.

11. run-fastqc.sh

run-fastqc.sh
This script is used for fastq file's quality analysis.

./run-fastqc.sh [input_fastq] [num_threads] [output_dir]

[input_fastq]: an input fastq file, either gzipped or not.
[num_threads]: number of threads to use.
[output_dir] : output directory, will be created automatically if not exists.

12. pairs_qc.py

pairs_qc.py
This script is used to perform quality control on the pairs file.
You must sussessfully run ./set_up.sh before you can use this script.

python3 pairs_qc.py [input.pairs.gz] [chrom_size_file] [output_dir] [enzyme]
  • input.pairs.gz: input pairs file.
  • chrom_size_file: chrom size file.
  • output_dir: output directory.
  • enzyme: enzyme used for Hi-C experiment, either 4 or 6.

13. bamqc

bamqc
This script is used to perform quality control on the bam files.
You must sussessfully run ./set_up.sh before you can use this script.

./bin/bamqc --outdir=[output_dir] --noextract -t 8 [bam_file/bam_file_dir]

[output_dir] : output directory, you should create it before running this.
[bam_file/bam_file_dir]: the file or directory that includes bam files.

14. array2cool

array2cool.py
This script is used to convert dict of arrays format to .cool format.
Usage

python3 array2cool.py [input.pkl] [output.cool] [resolution] [refer_genome_name] [mode]

The input pickle should be in a pickle file as dict: [chrom1_chrom2]:[array] format for common mode. Here array should be scipy sparce array.
For intra-chromsome only, the dict format can be [chrom]:[array] in pickle files.
[output.cool] is the name of the output cool file.
[resolution] is used to specify the resolution that stored in the output array.
[refer_genome_name] is used to specify the reference genome name. For example, "hg38","hg19","mm10" are valid inputs.
[mode]: 0: all chromosome mode (scipy sparce array); 1: intra-chromosome mode(scipy sparce array); 2: all chromosome mode (numpy array); 3: intra-chromosome mode(numpy array).

15. bam_align_quality.py

bam_align_quality.py
This script calculates the alignment quality of a given bam file.

python3 bam_align_quality.py [input.bam] [output_dir] [number_cpu] [mode]

[input.bam]: the input bam file.
[output_dir]: the output directory.
[number_cpu]: the number of cpu used.
[mode]: 0 for unsorted bam file, 1 for sorted bam file.
The output includes stats of Unmapped, Low quality (mapq), Singleton, Multimapped, Duplicate, Other, Unique, and Total.

16. pairs_quality.py

pairs_quality.py
This script is used to analyze the quality of the pairs file.

python3 pairs_quality.py [input.pairs.gz] [output_dir] [number_cpu]

[input.pairs.gz]: the input pairs gz file from 4DN pipeline (*.marked.sam.pairs.gz).
[output_dir]: the output directory
[number_cpu]: the number of cpu used
The script will generate a report file in the output directory.
The report file will contain the following information:

  1. Unmapped sequences: the number of unmapped sequences and the percentage.
  2. Singleton sequences: the number of singleton sequences and the percentage.
  3. Multimapped sequences: the number of multimapped sequences and the percentage.
  4. Duplicate sequences: the number of duplicate sequences and the percentage.
  5. Unique sequences: the number of unique sequences and the percentage.
  6. Total sequences: the total number of sequences.
  7. Detailed Unique Sequences Information:
    RU sequences: the number of RU sequences and the percentage.
    UR sequences: the number of UR sequences and the percentage.
    UU sequences: the number of UU sequences and the percentage.
    For detailed definition of RU/UR/UU, please see https://pairtools.readthedocs.io/en/latest/formats.html#pair-types

17. addnorm2hic.py

addnorm2hic.py
This script is used to add norm to the hic files.

 python3 addnorm2hic.py [input.hic] [resolution] [num_cpu] [memory]

[input.hic]: input hic path.
[resolution]: minimum resolution that normalization works.
[num_cpu]: number of cpus used to normalize.
[memory]: maximum memory (GB) that this script is allowed to use.
The calculated norm vectors will be automatically saved in the input.hic file.

18. count_hic_read.py

count_hic_read.py
This script is to count the total/total non-diag reads of cis/all.

python3 count_hic_read.py [input.hic] [resolution] [normalization_type] 

[input.hic]: input hic path.
[resolution] is used to specify the resolution that stored in the output array.
[normalization_type]: should be an integer 0-4, corresponds the following type:

0: NONE normalization applied, save the raw data to array.
1: VC normalization; 
2: VC_SQRT normalization; 
3: KR normalization; 
4: SCALE normalization.

19. extract_hicnorms.py

extract_hicnorms.py
This script is to extract the normalization vectors from a .hic file.

python3 extract_hicnorms.py [input.hic] [resolution] [normalization_type] [output_pkl]

[input.hic]: input hic path.
[resolution]: resolution to extract the normalization vector, [Integer].
[normalization_type]: should be one of the following: NONE, VC, VC_SQRT, KR, SCALE, [string].
[output_pkl]: output pickle file path.
The normalization vector is saved in dict format, where the key is the chromosome name and the value is the normalization vector.

20. loop_cleaner.py

loop_cleaner.py
This script is for filter out the loops on low mappability regions

python3 loop_cleaner.py [input.bed] [mappablility.bw] [output.bed] [threshold]
  • input.bed: the input bed file
  • mappablility.bw: the mappablility bigwig file
  • output.bed: the output bed file
  • threshold: the mappablility threshold used to clean loops

21. merge_bigwig.py

merge_bigwig.py
This script is used to merge bigwig files into one bigwig file.

python3 merge_bigwig.py [input_dir] [output_bw] [refer_genome.sizes]

[input_dir]: the directory containing all the bigwig files.
[output_bw]: the output bigwig file.
[refer_genome.sizes]: the chromosome sizes of the reference genome.

22. bigwig2array.py

bigwig2array.py
This script converts bigwig file to array format specified by resolution.

python3 bigwig2array.py [input_bw] [output_pkl] [resolution]

[input_bw]: the input bigwig file.
[output_pkl]: the output pkl file with [chrom]:[signal] format.
[resolution]: the output resolution of the signal.

23. merge_bam.py

merge_bam.py
This script is used to merge bam files into one bam file.

python3 merge_bam.py [input_dir] [output_bam]

[input_dir]: the directory containing all the bam files.
[output_bam]: the output merged bam file.

24. bed_cleaner.py

bed_cleaner.py
This script merges overlapping regions from bed file.

python3 bed_cleaner.py [input_bed] [output_bed]

[input_bed]: the input bed file.
[output_bed]: the output bed file without overlapping regions.

25. pkl_contact_stat.py

pkl_contact_stat.py
This script is used to plot contact frequency vs. genomic distance.

python3 pkl_contact_stat.py [input.pkl] [output.png] [genomic_dist]

[input.pkl]: the path to the pickle file containing the contact matrix [String].
[output.png]: the name of the output png file [String].
[genomic_dist]: the genomic distance for the plot [Integer].

26. merge_hic.py

merge_hic.py
This script is to merge two hic files to a new merged hic file with specified resolution.

python3 merge_hic.py [hic_file1] [hic_file2] [output_hic] [resolution] [refer_genome]

[hic_file1]: the first hic file to be merged.
[hic_file2]: the second hic file to be merged.
[output_hic]: the output merged hic file.
[resolution]: the resolution of the output merged hic file.
[refer_genome]: the name of the reference genome. For example, hg38, hg19, mm10.

26, merge_pkl.py

merge_pkl.py
This script is to merge two pkl files to a new merged pkl file.

python3 merge_pkl.py <pkl_file1> <pkl_file2> <output_pkl>

[pkl_file1]: the first pkl file to be merged.
[pkl_file2]: the second pkl file to be merged.
[output_pkl]: the output merged pkl file.

27. loop_apa.py

loop_apa.py
This script is for plot the loop average peak analysis (APA) on the hic matrix.

python3 loop_apa.py [hic.pkl] [input.bed] [output.png] [resolution] [window_size]
  • hic.pkl: the hic matrix file
  • input.bed: the input bed file including the loop regions
  • output.png: the output loop APA png file
  • resolution: the resolution of the hic matrix
  • window_size: the window size of the loop region

28. annotate_loop_gene.py

annotate_loop_gene.py
This script is for annotate the loop with gene information.

python3 annotate_loop_gene.py [input.bed] [gene_annotation] [output.bed]

input.bed: the input bed file that contains the loop information.
gene_annotation: the gene annotation file format:.gtf, like hg38.ncbiRefSeq.gtf.
output.bed: the output bed file that contains the annotated loop information.
The last two columns in the output.bed file are the closest gene and the distance to the loop (corresponds to x and y).

29. hic_coverage.py

hic_coverage.py
This script calculates the coverage of the Hi-C data.

python3 hic_coverage.py [input.pkl]

[input.pkl]: the input pkl file containing the Hi-C data

30. count_1d_read.py

count_1d_read.py
This script is to calculate the average read count per base and total read count in the bigwig file.

python3 bigwig2count.py [input_bw]

[input_bw]: the input bigwig file.

31. peak_f1.py

peak_f1.py
This script is to compare two peak files to calculate the F1 score.

python3 peak_f1.py [true.bed] [pred.bed] [max_dist]

[true.bed]: the true peaks, in bed format
[pred.bed]: the predicted peaks, in bed format
[max_dist]: the maximum distance to match the peaks

32. bigwig_scan_distribution.py

bigwig_peak_distribution.py
This script plots the peak distribution of the bigwig file.

python3 bigwig_scan_distribution.py [input.bw] [window_size] [stride] [output_fig] [mode]

[input.bw]: the input bigwig file.
[window_size]: the window size for analyzing peak distribution.
[stride]: the stride for analyzing peak distribution.
[output_fig]: the output figure path to show the peak distribution.
[mode]: 0:raw_value, 1:log10_value.

33. peak_call_bigWig.py

peak_call_bigWig.py
This script is to call peaks from the bigWig file. The bigWig file is generated from the bam file by using the deepTools bamCoverage.

python3 peak_call_bigWig.py -t input.bw -c control.bw -o output_dir -q [qval] --min_length [min_length] --thread [thread] [--broad] [--broad_cutoff=[broad_cutoff]]
  • t: path to the treatment bigWig file.
  • c: path to the control bigWig file.
  • o: output directory.
  • q: q-value cutoff (minimum FDR) for peak calling.
  • p: p-value cutoff for peak calling, if provided, q-value will be ignored.
  • min_length: minimum length of the peak, can be set with fragment size.
  • thread: number of threads to use.
  • broad: use broad peak calling.
  • broad_cutoff: cutoff for broad peak calling (it will be q-value or qvalue based on either you choose -p or -q).
    The output file is a bed file with the peak information.

34. peak_overlap.py

peak_overlap.py
This script is to compare two bed files and find the overlapping peaks.

python3 peak_overlap.py [input1.bed] [input2.bed] [overlap_ratio] [output_dir]
  • input1.bed: the first input bed file
  • input2.bed: the second input bed file
  • overlap_ratio: the ratio of overlap to consider as overlap
  • output_dir: the output directory
  • The output files are overlap1.bed, overlap2.bed, independent1.bed, independent2.bed, indicating the overlap peaks in input1.bed, overlap peaks in input2.bed, independent peaks in input1.bed, independent peaks in input2.bed.

35. report_peak_strength.py

report_peak_strength.py
This script is used to report the peak strength of the input bed file.

python3 report_peak_strength.py [input.bw] [input.bed] [output.bed]

[input.bw]: the input bigwig file.
[input.bed]: the input bed file.
[output.bed]: the output bed file, with last column represents the peak strength.

36. downsample_pkl.py

downsample_pkl.py
This script is used to downsample the input pickle file.

python3 downsample_pkl.py [input.pkl] [output.pkl] [downsample_rate]

[input.pkl]: the input pickle file.
[output.pkl]: the output pickle file.
[downsample_rate]: the downsample rate [Integer]. 16 means 1/16 of the original data.

37. TAD_detection.py

TAD_detection.py
This script is used to detect TADs from the input pickle file.

python3 TAD_detection.py --input [input.pkl] --output [output_dir] 

[input.pkl]: the input pickle file.
[output_dir]: the output directory.
In the output dir, the bound file will be saved in TAD.bed, insulation score and normed insulation score will be saved in insulation_score.pkl/norm_insulation_score.pkl.

38 plot_TAD_score_report.py

plot_TAD_score_report.py
This script is used to plot the TAD score report.

python3 plot_TAD_score_report.py [input.bed] [insulation_score.pkl] [output.pdf] [window_size]

[input.bed]: the input bed file.
[insulation_score.pkl]: the input pickle file containing the insulation score.
[output.pdf]: the output pdf/png file.
[window_size]: the window size for the plot.
The first two input is generated by TAD_detection.py.

39. plot_loop_length.py

plot_loop_length.py
This script is used to plot the loop strength report.

python3 plot_loop_strengt.py [input.bed] [output.pdf]

[input.bed]: the input bed file.
[output.pdf]: the output pdf/png file.

40. cmp_loop_report.py

cmp_loop_report.py
This script is used to compare the loop change between two bed files.

python3 cmp_loop_report.py [control.bed] [input.bed] [resolution] [output.pdf]

[control.bed]: the control bed file.
[input.bed]: the input bed file.
[resolution]: the resolution of the data.
[output.pdf]: the output pdf/png file. It is a pie chart showing the loop change.

41. hiccups_enrichment.py

hiccups_enrichment.py
This script is to calculate loop enrichment and output to BEDPE file.

python3 hiccups_enrichment.py --input_bedpe [input.bed] --input_pkl [hic.pkl] \
--output_bedpe [output.bed] --norm [norm_type] --num_cpus [int] --wobble_scope [int] \
--zeros_thresh [int] --donut_size [int] --peak_size [int] --resolution [int] \
--drop_chroms [chrY chrM] --savememory

[input.bed]: Path to the input .bedpe file, which records the loci information.
[hic.pkl]: Path to the input pickle file stored Hi-C data.
[output.bed]: Path to save the output .bedpe file.
[norm_type]: Normalization type for Hi-C data (e.g., 'VC', 'KR'). Required for peak enrichment calculation.
[num_cpus]: Number of CPUs for parallelization.
[wobble_scope]: Maximum allowed wobble around the initial bin coordinates. Default is 4(5Kb),2(10Kb),2(25Kb).
[zeros_thresh]: Maximum number of zeros allowed in the observed submatrix. Should set to (2*donut_size+1)^2.
[donut_size]: Radius of the donut kernel. Default is 7(5Kb),5(10Kb),3(25Kb).
[peak_size]: Radius of the peak region. Default is 4(5Kb),2(10Kb),1(25Kb).
[resolution]: Resolution for binning genomic coordinates. Default is 5000.
[drop_chroms]: Chromosomes to exclude from analysis. Default is ['chrY', 'chrM']. For example, --drop_chroms chrY chrM
[savememory]: Use memory-efficient calculations if set.

42. loop_overlap.py

loop_overlap.py
This script is used to compare the loop change between two bed files and outputs independent/overlap loops.

python3 loop_overlap.py [control.bed] [input.bed] [resolution] [output_dir]

[control.bed]: the control bed file recording the control loop location.
[input.bed]: the input bed file recording the input loop location.
[resolution]: the resolution of the Hi-C data.
[output_dir]: the output directory.
The output files are overlap.bed, independent1.bed, independent2.bed, indicating the overlap loops, independent loops in control, independent loops in input.

43. array2bigwig.py

array2bigwig.py
This script is used to merge bigwig files into one bigwig file.

python3 array2bigwig.py [input_file] [output_bigwig] [resolution]

[input_file]: the pkl file to be converted, should be a dict in format of [chr]:[array].
[output_bigwig]: the output bigwig file.
[resolution]: the resolution stored in the pkl file.

44. annotate_array_loop.py

annotate_array_loop.py
This script is to annotate and visualize the input array with loop information.

python3 annotate_array_loop.py [input.pkl] [loop.bed] [output.png] [chrom1] [start_index1] [end_index1] [chrom2] [start_index2] [end_index2] [resolution] [max_value] [mode]

[input.pkl] is the path to the pickle file containing the Hi-C array.
[input.pkl] format: [chrom1_chrom2]:[array] format for common mode. Here array should be scipy sparce array.
For intra-chromsome only, the dict format can be [chrom]:[array] in pickle files.
[loop.bed] is the path to the bed file containing the loop information.
[output.png] is the name of the output png file.
[chrom1] is the name of the first chromosome.
[start_index1] is the start index of the first chromosome.
[end_index1] is the end index of the first chromosome.
[chrom2] is the name of the second chromosome.
[start_index2] is the start index of the second chromosome.
[end_index2] is the end index of the second chromosome.
[resolution] is the resolution of the input array.
[max_value] is the maximum threshold of the input array for figures.
[mode]: 0:raw visualization; 1: log visualization.

45. looptrack_visualization.py

looptrack_visualization.py
This script is to annotate and visualize the input array with loop information and related to epigenomic assays.

python3 looptrack_visualization.py [input.pkl] [loop.bed] [track.bigWig] [output.png] [chrom1] [start_index1] [end_index1] [chrom2] [start_index2] [end_index2] [resolution] [max_value] [mode]

[input.pkl] is the path to the pickle file containing the Hi-C array.
[input.pkl] format: [chrom1_chrom2]:[array] format for common mode. Here array should be scipy sparce array.
For intra-chromsome only, the dict format can be [chrom]:[array] in pickle files.
[loop.bed] is the path to the bed file containing the loop information.
] [track.bigWig] is the path to the bigWig file containing the epigenomic track information.
[output.png] is the name of the output png file.
[chrom1] is the name of the first chromosome.
[start_index1] is the start index of the first chromosome.
[end_index1] is the end index of the first chromosome.
[chrom2] is the name of the second chromosome.
[start_index2] is the start index of the second chromosome.
[end_index2] is the end index of the second chromosome.
[resolution] is the resolution of the input array.
[max_value] is the maximum threshold of the input array for figures.
[mode]: 0:raw visualization; 1: log visualization.

46. extract_peak_sequence.py

extract_peak_sequence.py
This script is used to extract peak sequences from genome fasta file.

python3 extract_peak_sequence.py [peak_bed] [genome_fasta] [output_fasta] [window_region]

[peak_bed]: the bed file containing peak regions.
[genome_fasta]: the genome fasta file.
[output_fasta]: the output fasta file containing peak sequences.
[window_region]: the window region to extract sequence around peak regions.
If set to 0, then use the peak region itself.

47. cmp_bigwig_correlation.py

cmp_bigwig_correlation.py
This script is used to calculate the correlation between two bigwig files.

python3 cmp_bigwig_correlation.py [input1.bigWig] [input2.bigWig] [resolution]

[input1.bigWig]: the first bigwig file.
[input2.bigWig]: the second bigwig file.
[resolution]: the resolution to calculate the correlation.
This script will output pearson correlation, spearman correlation, and cosine similarity between the two bigwig files.

48.cmp_bigwig_peak_correlation.py

cmp_bigwig_peak_correlation.py
This script is used to calculate the correlation between two bigwig files on the locus specified in .bed file.

python3 cmp_bigwig_peak_correlation.py [input1.bigWig] [input2.bigWig] [reference.bed] [output_png]

[input1.bigWig]: the first bigwig file.
[input2.bigWig]: the second bigwig file.
[reference.bed]: the bed file containing the locus to calculate the correlation.
[output_png]: the output png file to save the peak total reads comparison.
This script will output pearson correlation, spearman correlation, and cosine similarity between the two bigwig files.

49.bigwig_peak_distribution.py

bigwig_peak_distribution.py
This script plots the peak distribution of the bigwig file according to the peak region specified in .bed file.

python3 bigwig_peak_distribution.py [input.bw] [input.peak] [output_fig] [mode]

[input.bw]: the input bigwig file.
[input.peak]: the input peak file,specify the peak region.
[output_fig]: the output figure path to show the peak distribution.
[mode]: 0:raw_value, 1:log10_value.

50.bigwig_2peak_distribution.py

bigwig_2peak_distribution.py
This script compares the positive and negative peak distribution of the bigwig file according to the peak region specified in .bed files.

python3 bigwig_2peak_distribution.py [input.bw] [positive.bed] [negative.bed] [output_fig] [mode]

[input.bw]: the input bigwig file.
[positive.bed]: the input positive peak file,specify the positive peak region.
[negative.bed]: the input negative peak file,specify the negative peak region.
[output_fig]: the output figure path to show the peak distribution.
[mode]: 0:raw_value, 1:log10_value.

51.loop_ctcfpeak_ratio.py

loop_ctcfpeak_ratio.py
This script is used to calculate the ratio of chromatin loops that overlap with CTCF ChIP peaks.

python3 loop_ctcfpeak_ratio.py [loop.bed] [ctcf_peak.bed] [resolution]

[loop.bed]: the chromatin loop coordinate, in bed format
[ctcf_peak.bed]: the CTCF peak coordinate, in bed format
[resolution]: the resolution of the Hi-C data

52. loop_ctcf_enrichment.py

loop_ctcf_enrichment.py
This script is used to calculate the CTCF enrichment in the loop regions.

python3 loop_ctcf_enrichment.py [loop.bed] [ctcf_chip.bw] [resolution] [output.bed]

[loop.bed]: the chromatin loop coordinate, in bed format
[ctcf_chip.bw]: the CTCF ChIP-seq signal in bigwig format
[resolution]: the resolution of the Hi-C data
[output.bed]: the output bed file. Format: chr1 start1 end1 start2 end2 enrichment1 enrichment2

53.diff2png.py

diff2png.py
This script is to visualize the difference comparison array in png format.

python3 diff2png.py [input.pkl] [output.png] [chrom1] [start_index1] [end_index1] [chrom2] [start_index2] [end_index2] [resolution] [vmin] [vmax]

input.pkl: the path to the pickle file containing the difference comparison array [String].
input.pkl format: [chrom1_chrom2]:[array] format for common mode. Here array should be scipy sparce array."For intra-chromsome only, the dict format can be [chrom]:[array] in pickle files.
output.png: the name of the output png file [String].
chrom1: the name of the first chromosome [String].
start_index1: the start index of the first chromosome [Integer].
end_index1: the end index of the first chromosome [Integer].
chrom2: the name of the second chromosome [String].
start_index2: the start index of the second chromosome [Integer].
end_index2: the end index of the second chromosome [Integer].
resolution: resolution of the input array [Integer].
vmin: the minimum threshold of the input array for figures[Float].
vmax: the maximum threshold of the input array for figures[Float].

54. bigwig_coverage.py

bigwig_coverage.py
This script calculates the total coverage of a bigwig file.

python3 bigwig_coverage.py [input_bw]

[input_bw]: the input bigwig file.

55. plot_bigwig_signal.py

plot_bigwig_signal.py
This script is used to plot the region of signal of a bigwig file.

python3 plot_bigwig_signal.py [input.bw] [output.png] [chromosome] [start] [end]

input.bw: the path to the bigwig file [String].
output.png: the name of the output png file [String].
chromosome: the chromosome name [String].
start: the start position of the region [Integer].
end: the end position of the region [Integer].

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •