HiCAT: Hierarchical Centromere structure AnnoTation Tool

Advanced long-read sequencing technologies have revolutionized genome assembly, unlocking the complex region centromere and signaling the new stage in genomics research. The new computing problems generated by these new areas, like centromere annotation problem, required novel bioinformatics methods. Here, we proposed HiCAT, a generalized computational tool based on hierarchical tandem repeat mining (HTRM) method and a tandem repeat coverage maximization strategy to automatically process centromere annotation.

Dependencies

Python 3.9.7

Packages	Version used in Research
numpy	1.22.3
pandas	1.4.0
networkx	3.0.3
numpy	2.7.1
joblib	1.1.0
Levenshtein	0.12.2
matplotlib	3.5.1

StringDecomposer (https://github.com/ablab/stringdecomposer) with version 1.1.2.
Development environment: Linux
Development tool: Pycharm

Usage

Run HiCAT

sh start.sh input_file output_path python_path sd_path monomer_template thread

input_file is input fasta file. e.g. ./testdata/cen21.fa。

output_path is the output directory。

python_path is the path of the python environment.

sd_path is the path of StringDecomposer.

monomer_template is monomer template for StringDecomposer to obtain blocks. e.g. ./testdata/AlphaSat.fa

tread is a number of thread.

Other optional parameters with default can be modified in the start.sh.include min_similarity(0.94), step(0.005), max_hor_len(40)

Visualization

HiCAT default visualized the top five HORs with repeat numbers greater than 10 in maximum HOR coverage similarity.

Custom visualization can use visualization.py
-r is HiCAT result directory.
-s is which similarity be visualized.
-sp is the number of top HORs.
-sn is the minimum repeat number of HOR.

Output

The result is in out directory.

out_hor.raw.fa contains each HOR DNA sequnce.

out_hor.normal.fa contains each normalized HOR DNA sequence. We normalized the raw DNA sequence to one represent HOR. For example, normalized 10_4_6_1_2_6_1_2_7_8_5_3_7_9 to 6_1_2_7_8_5_3_7_9_10_4 in CEN21.

out_monomer.fa contains each monomer DNA sequence.

pattern_static.xls is the HOR repeat number.

pattern_static.pdf is the HOR repeat number in bar plot.

plot_pattern.pdf is the plot for hierarchical centromere structure annotation

out_top_layer.xls is the annotation in top layer.

out_all_layer.xls is the annotation in all layer. Label "top" represent this region is in top layer. Label "cover" represent this region is coverd by a top layer region.

Contact

If you have any questions, please feel free to contact: gaoxian15002970749@163.com, xfyang@xjtu.edu.cn, kaiye@xjtu.edu.cn

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.idea		.idea
HiCATresult		HiCATresult
testdata		testdata
HiCAT_HOR.py		HiCAT_HOR.py
README.md		README.md
run_AT.py		run_AT.py
run_human.py		run_human.py
run_testdata.py		run_testdata.py
start.sh		start.sh
visualization.py		visualization.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HiCAT: Hierarchical Centromere structure AnnoTation Tool

Dependencies

Usage

Run HiCAT

Visualization

Output

Contact

About

Uh oh!

Releases

Packages

Languages

ningshuang-yao/HiCAT

Folders and files

Latest commit

History

Repository files navigation

HiCAT: Hierarchical Centromere structure AnnoTation Tool

Dependencies

Usage

Run HiCAT

Visualization

Output

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages