Analysis of 193 genomes and their centromeres

Requirements

Genome assembly data (example Arabidopsis thaliana chromosome 1 coordinates 1 : 20,000,000 attached in /test_data/ddAraThal4.1_chr1_20Mb.fa) Installed 3rd party software according to its documentation:

R version 4.1.3 or newer with libraries and their dependancies that include (each script lists required libraries):
- seqinr
- msa
- Biostrings
- dplyr
- Matrix
- pheatmap
- GenomicRanges
TRASH2 which also requires:
- MAFFT 7.526
- HMMER 3.4
EDTA 2.0.0
TEsorter 1.4.6
Helixer 0.3.4

Putative centromeric repeat identification

Run annotation software:
- TRASH2
- EDTA
- TEsorter
Filter annotations with scripts found in /repeat_te_gene_annotations_parsing, in order they appear. Modify paths at the beginning of the script to match the outputs from the step above
Calculate initial predictor scores for individual repeat families with /centromere_identification/6_find_centromeric_repeat.R script
Create a genomic landscape plot to visualise lcoations and predictor scores of top scoring families and decide on the putative centromeric repeat with /centromere_identification/10_global_plot.R

Centromere coordinates, gaps and genomic landscape replotting (Supplemental Data 1)

Most downstream analyses require a metadata .csv table with information on the previously identified centromeric families and classification of all genomes into a A. satellite based monocentric, B. satellite based holocentric, C. transposon based monocentric and D. unknown Another metadata file containing information (source genome name, repeat TRASH2 name and repeat custom name) on each inentified centromeric family is also required.

/centromere_identification/11_array_cen_pericen_coordinates.R Identifies centromeric and centromere-proximal regions for genomes with identified centromeric repeats
/centromere_identification/12_find_gaps.R Identifies gaps in the centromeric arrays
/centromere_identification/10.1_global_plot_plus_cen_coordinates.R replots the genomic landscape plot with additional information from the /11_array_cen_pericen_coordinates.R script and highlighting identified centromeric repeat

HOR Analysis

/hor_scripts/make_hor_scripts.R Extracts which chromosomes contain centromeric satellites and what are their names in order to create SLURM submission commands that will analyse them individually.
/hor_scrips/9_HOR_periods.R Creates visualisations of identified HORs in order to assess runs completeness and inform on the HOR content of individual genomes
/hor_scripts/0.01_HOR_score.R Calculates a HOR score for each individual repeat that had them identified

Downstream Analysis

/main_downstream_analysis Contains scripts used for remaining analyses, with script names describing their purpose and comments section at the beginning of many scripts adding details on their functionality and algorithms when needed. While numbered, these scripts do not have prerequisites of finishing previous scripts in this directory in order to run them.

Summary of genomes and centromeric satellites (Supplemental Table 1)

All scripts mentioned in the Putative centromere repeat identification, Centromere coordinates... and HOR anlaysis sections above need to be done. 3 additional scripts also need to be run and finished for the full table to be constructed, those calculate the most computationally expensive steps of the summary:

/main_summary/13.1_table_S1_HOR_stats.R Reads in and summarises results of the HOR analyses
/main_summary/13.2_table_S1_GC.R Calculates GC values for genomes, chromosomes and genomic intervals (repeats, transposons, their subsets etc.)
/main_downstream_analysis/37_similarity_values_within_between_chr.r Calculates similarity values within and between chromosomes for centromeric repeats /main_summary/13_table_S1_whole_summary.R Generates the full table for all analysed genomes

dN/dS analysis

/dNdS/analyze_dnds.ipnyb describes and visualises the CENH3 protein site dN/dS Analysis

Transposable element and gap Analysis

/te_and_gap_analysis/ directory contains scripts for transposon reannotation and rescue steps and those required for individual figures creation in subdirectories:

/te_and_gap_analysis/Figure_3/
/te_and_gap_analysis/Figure_4/
/te_and_gap_analysis/supp_data/

Additional information is provided in /te_and_gap_analysis/te_readme.txt file. bedtools v2.27.1 is used to intersect coordinates and extract sequences. Required libraries are specified in each R script.

Other

/FISH_oligo-pools/ contains pools of oligonucleotides used in FISH experiments
Figshare S1 figure contains plots made with /centromere_identification/10.1_global_plot_plus_cen_coordinates.R script that make the Supplemental Data 1 figure
Figshare additional data contains all main data created by the scripts in this repository
Figure 1a tree interactive tree from the Figure 1a
CENP-A tree interactive CENP-A tree from the Figure 1b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Analysis of 193 genomes and their centromeres

Requirements

Putative centromeric repeat identification

Centromere coordinates, gaps and genomic landscape replotting (Supplemental Data 1)

HOR Analysis

Downstream Analysis

Summary of genomes and centromeric satellites (Supplemental Table 1)

dN/dS analysis

Transposable element and gap Analysis

Other

About

Uh oh!

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
FISH_oligo-pools		FISH_oligo-pools
centromere_identification		centromere_identification
dNdS		dNdS
hor_scripts		hor_scripts
main_downstream_analyses		main_downstream_analyses
main_summary		main_summary
repeat_te_gene_annotations_parsing		repeat_te_gene_annotations_parsing
te_and_gap_analysis		te_and_gap_analysis
test_data		test_data
LICENSE		LICENSE
README.md		README.md
aux_fun.R		aux_fun.R

License

vlothec/193centromeres

Folders and files

Latest commit

History

Repository files navigation

Analysis of 193 genomes and their centromeres

Requirements

Putative centromeric repeat identification

Centromere coordinates, gaps and genomic landscape replotting (Supplemental Data 1)

HOR Analysis

Downstream Analysis

Summary of genomes and centromeric satellites (Supplemental Table 1)

dN/dS analysis

Transposable element and gap Analysis

Other

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages