Skip to content

ningshuang-yao/HiCAT

Repository files navigation

HiCAT: Hierarchical Centromere structure AnnoTation Tool

Advanced long-read sequencing technologies have revolutionized genome assembly, unlocking the complex region centromere and signaling the new stage in genomics research. The new computing problems generated by these new areas, like centromere annotation problem, required novel bioinformatics methods. Here, we proposed HiCAT, a generalized computational tool based on hierarchical tandem repeat mining (HTRM) method and a tandem repeat coverage maximization strategy to automatically process centromere annotation.

Dependencies

Python 3.9.7

Packages Version used in Research
numpy 1.22.3
pandas 1.4.0
networkx 3.0.3
numpy 2.7.1
joblib 1.1.0
Levenshtein 0.12.2
matplotlib 3.5.1

StringDecomposer (https://github.com/ablab/stringdecomposer) with version 1.1.2.
Development environment: Linux
Development tool: Pycharm

Usage

Run HiCAT

sh start.sh input_file output_path python_path sd_path monomer_template thread

input_file is input fasta file. e.g. ./testdata/cen21.fa。

output_path is the output directory。

python_path is the path of the python environment.

sd_path is the path of StringDecomposer.

monomer_template is monomer template for StringDecomposer to obtain blocks. e.g. ./testdata/AlphaSat.fa

tread is a number of thread.

Other optional parameters with default can be modified in the start.sh.include min_similarity(0.94), step(0.005), max_hor_len(40)

Visualization

HiCAT default visualized the top five HORs with repeat numbers greater than 10 in maximum HOR coverage similarity.

Custom visualization can use visualization.py
-r is HiCAT result directory.
-s is which similarity be visualized.
-sp is the number of top HORs.
-sn is the minimum repeat number of HOR.

Output

The result is in out directory.

out_hor.raw.fa contains each HOR DNA sequnce.

out_hor.normal.fa contains each normalized HOR DNA sequence. We normalized the raw DNA sequence to one represent HOR. For example, normalized 10_4_6_1_2_6_1_2_7_8_5_3_7_9 to 6_1_2_7_8_5_3_7_9_10_4 in CEN21.

out_monomer.fa contains each monomer DNA sequence.

pattern_static.xls is the HOR repeat number.

pattern_static.pdf is the HOR repeat number in bar plot.

plot_pattern.pdf is the plot for hierarchical centromere structure annotation

out_top_layer.xls is the annotation in top layer.

out_all_layer.xls is the annotation in all layer. Label "top" represent this region is in top layer. Label "cover" represent this region is coverd by a top layer region.

Contact

If you have any questions, please feel free to contact: gaoxian15002970749@163.com, xfyang@xjtu.edu.cn, kaiye@xjtu.edu.cn

About

A tool for automatic annotation of centromere structure

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published