curated list of awesome resources for evolutionary and population genomics [work in progress]
- InternationalGenome.org - data portal for the 1000 Genomes and other projects
- Genome Projects at Max Planck Institute for Evolutionary Antropology - sequencing data for the Neanderthals and Denisovans
- http://cdna.eva.mpg.de/neandertal/ - direct downloads
- Simons Genome Diversity Project (SGDP) - WGS data from 142 populations around the world
- The Allen Ancient DNA Resource (AADR) - ancient and modern human samples sequenced using the 1240K SNP panel
- Ultraconserved elements (UCEs) - resources for ultraconserved elements (UCEs), a useful set of genome-wide markers, especially for non-model taxa without reference genomes. The combination of conserved sequences with variable flanking regions offers markers to study evolution at different levels, from populations to phylogenomics at higher taxonomic ranks.
- The Vertebrate Genomes Project - aims to sequence genomes for all known vertebrate species.
- VEuPath database of eukaryotic pathogen, vector and host informatics
- TriTryp database of trypanosomatid parasites
Helpful tutorials, blogs, and books on topics in evomics, bioinformatics, and data science.
- Speciation genomics - tutorials covering around 70% of my PhD, too bad I found the page after my defense
- their github includes example data, code, presentations, and other material
- Evomics.org - portal with materials from years of summer schools on evolutionary genomics
- The G-cat - genetic theory in nice digestible articles
- Introduction to the Command Line for Genomics - a course by Data Carpentry
- Population genetics and genomics in R - especially great for non-model taxa
- Bioinformatics Data Skills - awesome book by Vince Buffalo
- Data Science at the Command Line - great free book by Jeroen Janssens
- Ad Hoc Data Analysis From The Unix Command Line - free book at Wikibooks
- Bioconda - channel of bioinformatic software, for the conda / mamba package managers
- Conda-forge - channel of scientific software, for the conda / mamba package managers
- Homebrew Bio - repository of bioinformatic software for the Homebrew / Linuxbrew package managers
- Bioconductor - bioinformatic packages and versioned data in R
- MethodsPopGen.com - overview of software tools for population and evolutionary genomics, described in a review paper
- PLINK2 - toolkit for population genomics and GWAS
- EIGENSOFT - tools for analysis of populations, including population stratification and SmartPCA
- ADMIXTOOLS2 - R package with reimplementation of the original ADMIXTOOLS, with higher performance and easy scripting interface, plus a GUI webapp
- ADMIXTOOLS - the original ADMIXTOOLS package
- msprime - coalescent simulator
- SLiM - forward-time simulator for spatial models of evolution
- slendr - R interface to msprime and SLiM simulators, with support for spatial and non-spatial models
- stdpopsim - library of standard population genetic simulation models
While genotype matrices are the dominant data type in evomics, other data types and formats appear as well - from FASTA reference sequences or alignments, to genomic features and annotations.
- HTSlib - Umbrella project for Samtools and related packages
- SeqKit - for efficient manipulation of FASTA/FASTQ formats
- SeqTK - for efficient manipulation of FASTA/FASTQ formats
- bioawk - extension of the AWK language with support for common bioinformatic formats and compressed data
- Seqmagick - a kickass little utility built in the spirit of imagemagick to expose the file format conversion in Biopython in a convenient way. Instead of having a big mess of scripts, there is one that takes arguments.
There is plenty of tabular data in bioinformatics, from the well-known formats to all kinds of metadata. Many tools were developed to process generic tabular data.
- structured text tools - overview of tools for processing structured text
- Miller - Miller is like awk, sed, cut, join, and sort for data formats such as CSV, TSV, JSON, JSON Lines, and positionally-indexed
- csvtk - fast CSV/TSV toolkit in Go, with many features and simple plot functions
- xsv - fast CSV/TSV toolkit in Rust
- visidata - terminal spreadsheet app
- grabix - like tabix but for non-bio data (indexing by line numbers instead of genomic positions); fast slicing / random sampling of large compressed tabular data