Advanced long-read sequencing technologies have revolutionized genome assembly, unlocking the complex region centromere and signaling the new stage in genomics research. The new computing problems generated by these new areas, like centromere annotation problem, required novel bioinformatics methods. Here, we proposed HiCAT, a generalized computational tool based on hierarchical tandem repeat mining (HTRM) method and a tandem repeat coverage maximization strategy to automatically process centromere annotation.
Python 3.9.7
Packages | Version used in Research |
---|---|
numpy | 1.22.3 |
pandas | 1.4.0 |
networkx | 3.0.3 |
numpy | 2.7.1 |
joblib | 1.1.0 |
Levenshtein | 0.12.2 |
matplotlib | 3.5.1 |
StringDecomposer (https://github.com/ablab/stringdecomposer) with version 1.1.2.
Development environment: Linux
Development tool: Pycharm
sh start.sh input_file output_path python_path sd_path monomer_template thread
input_file is input fasta file. e.g. ./testdata/cen21.fa。
output_path is the output directory。
python_path is the path of the python environment.
sd_path is the path of StringDecomposer.
monomer_template is monomer template for StringDecomposer to obtain blocks. e.g. ./testdata/AlphaSat.fa
tread is a number of thread.
Other optional parameters with default can be modified in the start.sh.include min_similarity(0.94), step(0.005), max_hor_len(40)
HiCAT default visualized the top five HORs with repeat numbers greater than 10 in maximum HOR coverage similarity.
Custom visualization can use visualization.py
-r is HiCAT result directory.
-s is which similarity be visualized.
-sp is the number of top HORs.
-sn is the minimum repeat number of HOR.
The result is in out directory.
out_hor.raw.fa contains each HOR DNA sequnce.
out_hor.normal.fa contains each normalized HOR DNA sequence. We normalized the raw DNA sequence to one represent HOR. For example, normalized 10_4_6_1_2_6_1_2_7_8_5_3_7_9 to 6_1_2_7_8_5_3_7_9_10_4 in CEN21.
out_monomer.fa contains each monomer DNA sequence.
pattern_static.xls is the HOR repeat number.
pattern_static.pdf is the HOR repeat number in bar plot.
plot_pattern.pdf is the plot for hierarchical centromere structure annotation
out_top_layer.xls is the annotation in top layer.
out_all_layer.xls is the annotation in all layer. Label "top" represent this region is in top layer. Label "cover" represent this region is coverd by a top layer region.
If you have any questions, please feel free to contact: gaoxian15002970749@163.com, xfyang@xjtu.edu.cn, kaiye@xjtu.edu.cn