Skip to content

zhou-lab/InfiniumManifestAnnotator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

Dependencies

  • bedtools: bedtools
  • pysam: pip install pysam
  • Run on linux

Generating strain-specific manifest with given vcf, manifest, and reference genome

python strain_specific_manifest.py -h
usage: python strain_specific_manifest.py vcf manifest outdir  

Manifest file should contain columns of chr, start, end, M,U,target, col,Probe_ID, mapFlag,mask     
it can be generated by:
zcat MM285.mm10.manifest.tsv.gz | sed '1d' | cut -f 1-6,8-9,11,31 > test/manifest.bed

For vcf file, there must be one line starts with #CHROM

generate strain-specific manifests for SeSAMe

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         Display version
  -vcf <file>, --vcf <file>
                        input vcf
  -mft <file>, --mft <file>
                        Input manifest, a bed file contains columns of chr,start,end, M,U,target,col,Probe_ID, mapFlag,mask,
                        but should not contain header.
  -ref <file>, --Reference <file>
                        reference genome, .fa
  -o <outdir>, --outdir <outdir>
                        output directory

example usage: strain_specific_manifest.py -vcf mgp.v5.merged.snps_all.dbSNP142.vcf.gz -mft MM285.mm10.manifest.bed -ref /mnt/isilon/zhou_lab/projects/20191221_references/mm10/mm10.fa -o test_dir

Please Note:

If the chromosome names in the input vcf, manifest and reference genome were not consistent, the script would handle it automatically by adding or removing 'chr' or replace chrM to chrMT or chrMT to chrM.

If the output of bedtools had already existed under outdir, the process of running bedtools would be skipped.

Demo

mkdir Demo
zcat /mnt/isilon/zhoulab/labprojects/20200228_Mouse_Array_Project/20210324_SeSAme_Wubin/data/strainSpecificManifest/MM285.mm10.manifest.tsv.gz | sed '1d' | cut -f 1-6,8-9,11,31 > Demo/MM285.mm10.manifest.bed
python strain_specific_manifest.py -vcf /mnt/isilon/zhoulab/labprojects/20200228_Mouse_Array_Project/20210324_SeSAme_Wubin/data/strainSpecificManifest/mgp.v5.merged.snps_all.dbSNP142.vcf.gz -mft Demo/MM285.mm10.manifest.bed -ref /mnt/isilon/zhou_lab/projects/20191221_references/mm10/mm10.fa -o Demo

All files would be generated under outdir [Demo]

Demo
├── chr1.bed
├── manifest.6bp.bed
├── manifest.6bp.sorted.bed
├── manifest.checked.bed
├── manifests
│   ├── 129P2_OlaHsd.txt
│   ├── 129S1_SvImJ.txt
│   ├── 129S5SvEvBrd.txt
│   ├── A_J.txt
│   ├── AKR_J.txt
│   ├── BALB_cJ.txt
│   ├── BTBR_T+_Itpr3tf_J.txt
│   ├── BUB_BnJ.txt
│   ├── C3H_HeH.txt
│   ├── C3H_HeJ.txt
│   ├── C57BL_10J.txt
│   ├── C57BL_6NJ.txt
│   ├── C57BR_cdJ.txt
│   ├── C57L_J.txt
│   ├── C58_J.txt
│   ├── CAST_EiJ.txt
│   ├── CBA_J.txt
│   ├── DBA_1J.txt
│   ├── DBA_2J.txt
│   ├── FVB_NJ.txt
│   ├── I_LnJ.txt
│   ├── KK_HiJ.txt
│   ├── LEWES_EiJ.txt
│   ├── LP_J.txt
│   ├── MOLF_EiJ.txt
│   ├── NOD_ShiLtJ.txt
│   ├── NZB_B1NJ.txt
│   ├── NZO_HlLtJ.txt
│   ├── NZW_LacJ.txt
│   ├── PWK_PhJ.txt
│   ├── RF_J.txt
│   ├── SEA_GnJ.txt
│   ├── SPRET_EiJ.txt
│   ├── ST_bJ.txt
│   ├── WSB_EiJ.txt
│   └── ZALENDE_EiJ.txt
├── manifest.var.overlapped.annotated.txt
├── manifest.var.overlapped.bed
├── manifest.var.overlapped.filtered.txt
├── MM285.mm10.manifest.bed
└── strain_specific_snp.txt
head Demo/manifests/A_J.txt
Probe_ID        M       U       col     mask
cg36602742_TC11 28608858.0      20700992        R       FALSE
cg36602743_TC21         55772828                FALSE
cg36602902_BC11 90750550.0      93783238        R       FALSE
cg36603287_TC21         13783923                FALSE
cg36603393_BC21         74727392                FALSE
cg36603791_TC21         90692186                FALSE
cg36603848_TC21         26796263                FALSE
cg36604001_TC21         86700295                FALSE
cg36604489_TC21         58682505                FALSE

generate MM285.addressStrain in R if necessary

strain_manifests_dir="Demo/manifests"
library(sesame)
library(stringr)
manifest=sesameDataGet("MM285.address")
addr <- list()
strains <- list()
for(name in list.files(path=strain_manifests_dir)) {
  print(name)
  strain <- str_replace(name,".txt",'')
  print(strain)
  mft <- read.csv(paste(strain_manifests_dir,name,sep='/'),sep='\t',header=T,check.names = F)
  mft <- mft[match(mft$Probe_ID,manifest$ordering$Probe_ID),]
  mft[mft$col=='','col'] <- NA
  L <- list(col=as.factor(mft$col),mask=mft$mask,Strain=strain)
  strains[[strain]] <- L
}
addr$ordering <- manifest$ordering
addr$strains <- strains
length(names(addr$strains))
class(addr$strains$PWK_PhJ$mask)
saveRDS(addr, file = "/Demo/MM285.addressStrain.rds")

About

Scripts for annotating Infinium manifest for genetic variation influence

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages