- bedtools: bedtools
- pysam: pip install pysam
- Run on linux
python strain_specific_manifest.py -h
usage: python strain_specific_manifest.py vcf manifest outdir
Manifest file should contain columns of chr, start, end, M,U,target, col,Probe_ID, mapFlag,mask
it can be generated by:
zcat MM285.mm10.manifest.tsv.gz | sed '1d' | cut -f 1-6,8-9,11,31 > test/manifest.bed
For vcf file, there must be one line starts with #CHROM
generate strain-specific manifests for SeSAMe
optional arguments:
-h, --help show this help message and exit
-v, --version Display version
-vcf <file>, --vcf <file>
input vcf
-mft <file>, --mft <file>
Input manifest, a bed file contains columns of chr,start,end, M,U,target,col,Probe_ID, mapFlag,mask,
but should not contain header.
-ref <file>, --Reference <file>
reference genome, .fa
-o <outdir>, --outdir <outdir>
output directory
example usage: strain_specific_manifest.py -vcf mgp.v5.merged.snps_all.dbSNP142.vcf.gz -mft MM285.mm10.manifest.bed -ref /mnt/isilon/zhou_lab/projects/20191221_references/mm10/mm10.fa -o test_dir
Please Note:
If the chromosome names in the input vcf, manifest and reference genome were not consistent, the script would handle it automatically by adding or removing 'chr' or replace chrM to chrMT or chrMT to chrM.
If the output of bedtools had already existed under outdir, the process of running bedtools would be skipped.
mkdir Demo
zcat /mnt/isilon/zhoulab/labprojects/20200228_Mouse_Array_Project/20210324_SeSAme_Wubin/data/strainSpecificManifest/MM285.mm10.manifest.tsv.gz | sed '1d' | cut -f 1-6,8-9,11,31 > Demo/MM285.mm10.manifest.bed
python strain_specific_manifest.py -vcf /mnt/isilon/zhoulab/labprojects/20200228_Mouse_Array_Project/20210324_SeSAme_Wubin/data/strainSpecificManifest/mgp.v5.merged.snps_all.dbSNP142.vcf.gz -mft Demo/MM285.mm10.manifest.bed -ref /mnt/isilon/zhou_lab/projects/20191221_references/mm10/mm10.fa -o Demo
All files would be generated under outdir [Demo]
Demo
├── chr1.bed
├── manifest.6bp.bed
├── manifest.6bp.sorted.bed
├── manifest.checked.bed
├── manifests
│ ├── 129P2_OlaHsd.txt
│ ├── 129S1_SvImJ.txt
│ ├── 129S5SvEvBrd.txt
│ ├── A_J.txt
│ ├── AKR_J.txt
│ ├── BALB_cJ.txt
│ ├── BTBR_T+_Itpr3tf_J.txt
│ ├── BUB_BnJ.txt
│ ├── C3H_HeH.txt
│ ├── C3H_HeJ.txt
│ ├── C57BL_10J.txt
│ ├── C57BL_6NJ.txt
│ ├── C57BR_cdJ.txt
│ ├── C57L_J.txt
│ ├── C58_J.txt
│ ├── CAST_EiJ.txt
│ ├── CBA_J.txt
│ ├── DBA_1J.txt
│ ├── DBA_2J.txt
│ ├── FVB_NJ.txt
│ ├── I_LnJ.txt
│ ├── KK_HiJ.txt
│ ├── LEWES_EiJ.txt
│ ├── LP_J.txt
│ ├── MOLF_EiJ.txt
│ ├── NOD_ShiLtJ.txt
│ ├── NZB_B1NJ.txt
│ ├── NZO_HlLtJ.txt
│ ├── NZW_LacJ.txt
│ ├── PWK_PhJ.txt
│ ├── RF_J.txt
│ ├── SEA_GnJ.txt
│ ├── SPRET_EiJ.txt
│ ├── ST_bJ.txt
│ ├── WSB_EiJ.txt
│ └── ZALENDE_EiJ.txt
├── manifest.var.overlapped.annotated.txt
├── manifest.var.overlapped.bed
├── manifest.var.overlapped.filtered.txt
├── MM285.mm10.manifest.bed
└── strain_specific_snp.txt
head Demo/manifests/A_J.txt
Probe_ID M U col mask
cg36602742_TC11 28608858.0 20700992 R FALSE
cg36602743_TC21 55772828 FALSE
cg36602902_BC11 90750550.0 93783238 R FALSE
cg36603287_TC21 13783923 FALSE
cg36603393_BC21 74727392 FALSE
cg36603791_TC21 90692186 FALSE
cg36603848_TC21 26796263 FALSE
cg36604001_TC21 86700295 FALSE
cg36604489_TC21 58682505 FALSE
strain_manifests_dir="Demo/manifests"
library(sesame)
library(stringr)
manifest=sesameDataGet("MM285.address")
addr <- list()
strains <- list()
for(name in list.files(path=strain_manifests_dir)) {
print(name)
strain <- str_replace(name,".txt",'')
print(strain)
mft <- read.csv(paste(strain_manifests_dir,name,sep='/'),sep='\t',header=T,check.names = F)
mft <- mft[match(mft$Probe_ID,manifest$ordering$Probe_ID),]
mft[mft$col=='','col'] <- NA
L <- list(col=as.factor(mft$col),mask=mft$mask,Strain=strain)
strains[[strain]] <- L
}
addr$ordering <- manifest$ordering
addr$strains <- strains
length(names(addr$strains))
class(addr$strains$PWK_PhJ$mask)
saveRDS(addr, file = "/Demo/MM285.addressStrain.rds")