Skip to content

caixu0518/PanK-Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 

Repository files navigation

PanK-Pipeline: A Pan-genome K-mer Pipeline for Population Analysis

Introduction

Installation

The pipeline PanK-Pipeline is installation-free but requires dependencies:

Required:

  1. The pipeline PanK-Pipeline was run in a Perl environment (version 5.28.3).
  2. jellyfish. In the present pipeline, jellyfish is mainly used to quickly generate k-mers from resequencing reads and perform k-mer query.
  3. bedtools (version v2.27.1 was used).

Optional:

  1. tabix/bgzip . tabix (version: 0.2.5 (r1005) was used).
  2. plink. plink (v1.90b6.21 was used) is mainly used when performing population structure and PCA analysis with k-mers.
  3. VCF2Dis. VCF2Dis (VCF2Dis-1.54 was used) is used to make the phylogenetic tree based on k-mer presence and absence matrix.
  4. faststructure. faststructure (v1.0 was used) is used to make the population structure analysis. The present pipeline recomeneded a docker repository (dockerbiotools/faststructure).

Inputs

  1. Pan-genome assemblies in fasta format.
  2. Resequencing population reads.

Outputs

  1. Pan-genome polymorphic k-mers.
  2. Pan-genome representative k-mers.
  3. A VCF file was generated containing the genotypes of all representative k-mers across all individuals in the resequencing population.

PanK-Pipeline

Here, We applied the Brassica rapa pangenome, comprising 30 assemblies, together with resequencing data from 1,543 accessions, to demonstrate the PanK-Pipeline pipeline. This pipeline is broadly applicable to pangenome assemblies and population resequencing data from any species.

Step1. Pipeline for constructing Pan-genome polymorphic k-mers using 30 B. rapa genome assemblies.

The current script Generate_PolymorphyicKmers.pl is used for constructing polymorphic k-mers using Pan-genome assemblies

perl  Generate_PolymorphyicKmers.pl  -species Brapa  -ksize 17  -pangenome  Pangenome.txt  -PipelinePath  /mydata/caix/PanK-Pipeline

-species        [required]    Prefix for output files    i.e. Brapa
-ksize          [required]    k-mer size  i.e 17
-pangenome      [required]    A file containing the abbreviation of each pan-genome member together with the absolute path to its genome sequence.
-PipelinePath   [required]    The absolute path to the PanK-Pipeline directory   i.e. /mydata/caix/PanK-Pipeline

##- A detailed example of the '-pangenome' parameter input file: the first column specifies the abbreviation, and the second column specifies the absolute path to the genome FASTA file, with columns separated by a tab.
Bra_A03.A04	/mydata/caix/PanK-Pipeline/data/aa/new/Bra_A03.A04.fasta
Bra_BRO.A04	/mydata/caix/PanK-Pipeline/data/aa/new/Bra_BRO.A04.fasta
Bra_CCA.A04	/mydata/caix/PanK-Pipeline/data/aa/new/Bra_CCA.A04.fasta

##- The output file of Step 1 have the suffix “.list.Polymorphic_kmers.List” and serve as the input files for Step 2. The file contains three columns: the first column is the k-mer sequence; the second column indicates the total number of occurrences of the k-mer across the pangenome; and the third column reports the frequency of the k-mer among the pangenome members, calculated as the proportion of genomes containing the k-mer relative to the total number of genomes.
see an example:
ATTGCATTCCTTAAAAC  100  0.33
CTGACCTCCTTTGTCTC  190  0.43
GGGCATCCACGACTTTA  80  0.20

Step2: Pipeline for identifying Pan-genome representative k-mers across the B. rapa species.

perl  Generate_RepresentativeKmers.pl  -species  Brapa  -ksize  17  -pangenome  Pangenome.txt    -PipelinePath   /mydata/caix/PanK-Pipeline   -PolymorphicKmer  rapa.merged.kmer.k17.list.Polymorphic_kmers.List
-species            [required]    Prefix for output files    i.e. Brapa
-ksize              [required]    k-mer size  i.e 17
-pangenome          [required]    A file containing the abbreviation of each pan-genome member together with the absolute path to its genome sequence.
-PipelinePath       [required]    The absolute path to the PanK-Pipeline directory   i.e. /mydata/caix/PanK-Pipeline
-PolymorphicKmer    [required]    rapa.merged.kmer.k17.list.Polymorphic_kmers.List

##- 'rapa.merged.kmer.k17.list.Polymorphic_kmers.List' was generated by the step 1.
##- The output files of Step 2 have the suffix “.representative.list” and share the same format as the “.list.Polymorphic_kmers.List” files generated in Step 1.

Step3:Application of Pan-genome representative k-mers for population structure analysis in B. rapa.

The output VCF file (kmer.gt.vcf.gz) and its corresponding index file (kmer.gt.vcf.gz.tbi) can be used as input for downstream analyses, including phylogenetic tree construction with VCF2Dis, principal component analysis (PCA) with plink, and population structure inference with faststructure.

##- Genotyping representative k-mers in each resequencing accession. We provided a shell script to call CountKmersInReads.pl for batch genotyping of k-mers in each resequenced sample
sh  Batch_KmerGenotyping.sh   *.representative.list   samids.txt  17  /mydata/caix/CC_k17_analysis/jffiles  /mydata/caix/PanK-Pipeline/scripts
Input parameters for Batch_KmerGenotyping.sh:
1. *.representative.list: representative k-mers generated from Step 2.
2. samids.txt: The file contains all Resequenced sample ID. Resequenced sample ID (must start with a letter, recommended length <8 characters).
3. 17: k-mer size.
4. /mydata/caix/CC_k17_analysis/jffiles   The directory containing jf files of the specified k-mer length generated from resequencing reads. We recommend converting each resequencing dataset into the binary jf format using Jellyfish prior to running the pipeline, in order to reduce disk storage requirements

Note: The Batch_KmerGenotyping.sh program outputs the genotyping results of representative k-mers for each resequenced accession, which are stored in the 'countresults' directory.

##- The current script PopKmerGenotypesToVCF.pl is used to aggregate the genotyping results from each resequenced accession into a single VCF file.
perl PopKmerGenotypesToVCF.pl  -Sam   samid.txt    -KmerGTDir  /mydata/caix/CC_k17_analysis/genotyping/countresults  -KmerList  merged.kmers.list.Polymorphic_kmers.List.representative.list   -threads    20
-Sam         [Required]  The file contains all Resequenced sample ID. Resequenced sample ID (must start with a letter, recommended length <8 characters).
-KmerGTDir   [Required]  Directory for storing the genotyping results of each resequenced accession
-KmerList    [Required]  The file was used as the input for genotyping k-mers in each resequencing accession. Be sure to use the same file, as all genotyping result files have the same number of lines, which (starting from 1) serve as positions in the VCF.
-threads     [Optional]  The number of threads used in the current script. default: 60

Note: The PopKmerGenotypesToVCF.pl program generates the VCF file kmer.gt.vcf.gz and its corresponding index file kmer.gt.vcf.gz.tbi. Both files serve as input for subsequent population analyses.

Scripts for conducting population domestication analysis

##- The current script Generate_enrichmentKmersInTargetGroup.pl is used to detected candidate domestication related k-mers between derived and control groups   
perl Generate_enrichmentKmersInTargetGroup.pl -deriviedGroup Cabbageid.txt -vcfIn CC.part.vcf -output CC.candidate.kmer.list
-deriviedGroup   [required]  The file contains the IDs of all members of the derived group.
-vcfIn           [required]  The VCF file containing the genotypes of all representative k-mers across all individuals in the resequencing population
-output          [required]  The output file name

##- The current script AssignCandidatekmersToAssembly.pl is used to  assign the candidate k-mers to different assemblies
perl AssignCandidatekmersToAssembly.pl -candidateKmers CC.candidate.kmer.list -fastaKmers fastaKmerfile.txt -representativeKmers merged.kmers.list.Polymorphic_kmers.List.representative.list
-candidateKmers        [required] Candidate k-mer index, output from  Generate_enrichmentKmersInTargetGroup.pl
-fastaKmers            [required] The file includes the filenames of the k-mer files corresponding to each assembly. These k-mer files are generated in Step 1 and are located in the 'FastaKmerList' directory.
-representativeKmers   [required] Representative k-mer list file

Inputs:
1. ”fastaKmerfile.txt“ contains one k-mer file per line, with each line corresponding to the k-mer file of a genome sequence.
T01.k17.clean.list.gz
T02.k17.clean.list.gz
Outputs:
1. "Kmers.sample.log" The proportion of domestication-related k-mers of the derived group in different genomes.
2. "results" The results directory contains the genomic locations of the candidate k-mers in each genome.

Additional scripts to perform population structure analysis

##- make the phylogenetic tree based on the VCF file containing the genotypes of all representative _k_-mers across all individuals in the resequencing population.
VCF2Dis_multi -InPut  kmer.gt.vcf   -Threads 80   -OutPut  kmer.gt.vcf.mat

##- PCA analysis
plink  --vcf  ${vcfIn}  --recode    --allow-extra-chr  --out  ${prefix}   --vcf-half-call  haploid
plink --file ${prefix}   --make-bed  --out ${prefix}  --allow-extra-chr
plink  --noweb --bfile  ${prefix}  --pca 20 --allow-extra-chr  --out plink.pca

##- structure analysis
/usr/bin/python    /fastStructure/structure.py   --input=$input  --output=final  -K $k

Citations

About

PanK-Pipeline: A Pan-genome K-mer Pipeline for Population Analysis

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published