Genome assembly data (example Arabidopsis thaliana chromosome 1 coordinates 1 : 20,000,000 attached in /test_data/ddAraThal4.1_chr1_20Mb.fa) Installed 3rd party software according to its documentation:
- R version 4.1.3 or newer with libraries and their dependancies that include (each script lists required libraries):
- seqinr
- msa
- Biostrings
- dplyr
- Matrix
- pheatmap
- GenomicRanges
- TRASH2 which also requires:
- EDTA 2.0.0
- TEsorter 1.4.6
- Helixer 0.3.4
- Run annotation software:
- TRASH2
- EDTA
- TEsorter
- Filter annotations with scripts found in /repeat_te_gene_annotations_parsing, in order they appear. Modify paths at the beginning of the script to match the outputs from the step above
- Calculate initial predictor scores for individual repeat families with /centromere_identification/6_find_centromeric_repeat.R script
- Create a genomic landscape plot to visualise lcoations and predictor scores of top scoring families and decide on the putative centromeric repeat with /centromere_identification/10_global_plot.R
Most downstream analyses require a metadata .csv table with information on the previously identified centromeric families and classification of all genomes into a A. satellite based monocentric, B. satellite based holocentric, C. transposon based monocentric and D. unknown Another metadata file containing information (source genome name, repeat TRASH2 name and repeat custom name) on each inentified centromeric family is also required.
- /centromere_identification/11_array_cen_pericen_coordinates.R Identifies centromeric and centromere-proximal regions for genomes with identified centromeric repeats
- /centromere_identification/12_find_gaps.R Identifies gaps in the centromeric arrays
- /centromere_identification/10.1_global_plot_plus_cen_coordinates.R replots the genomic landscape plot with additional information from the /11_array_cen_pericen_coordinates.R script and highlighting identified centromeric repeat
- /hor_scripts/make_hor_scripts.R Extracts which chromosomes contain centromeric satellites and what are their names in order to create SLURM submission commands that will analyse them individually.
- /hor_scrips/9_HOR_periods.R Creates visualisations of identified HORs in order to assess runs completeness and inform on the HOR content of individual genomes
- /hor_scripts/0.01_HOR_score.R Calculates a HOR score for each individual repeat that had them identified
/main_downstream_analysis Contains scripts used for remaining analyses, with script names describing their purpose and comments section at the beginning of many scripts adding details on their functionality and algorithms when needed. While numbered, these scripts do not have prerequisites of finishing previous scripts in this directory in order to run them.
All scripts mentioned in the Putative centromere repeat identification, Centromere coordinates... and HOR anlaysis sections above need to be done. 3 additional scripts also need to be run and finished for the full table to be constructed, those calculate the most computationally expensive steps of the summary:
- /main_summary/13.1_table_S1_HOR_stats.R Reads in and summarises results of the HOR analyses
- /main_summary/13.2_table_S1_GC.R Calculates GC values for genomes, chromosomes and genomic intervals (repeats, transposons, their subsets etc.)
- /main_downstream_analysis/37_similarity_values_within_between_chr.r Calculates similarity values within and between chromosomes for centromeric repeats /main_summary/13_table_S1_whole_summary.R Generates the full table for all analysed genomes
/dNdS/analyze_dnds.ipnyb describes and visualises the CENH3 protein site dN/dS Analysis
/te_and_gap_analysis/ directory contains scripts for transposon reannotation and rescue steps and those required for individual figures creation in subdirectories:
- /te_and_gap_analysis/Figure_3/
- /te_and_gap_analysis/Figure_4/
- /te_and_gap_analysis/supp_data/
Additional information is provided in /te_and_gap_analysis/te_readme.txt file. bedtools v2.27.1 is used to intersect coordinates and extract sequences. Required libraries are specified in each R script.
- /FISH_oligo-pools/ contains pools of oligonucleotides used in FISH experiments
- Figshare S1 figure contains plots made with /centromere_identification/10.1_global_plot_plus_cen_coordinates.R script that make the Supplemental Data 1 figure
- Figshare additional data contains all main data created by the scripts in this repository
- Figure 1a tree interactive tree from the Figure 1a
- CENP-A tree interactive CENP-A tree from the Figure 1b