-
Notifications
You must be signed in to change notification settings - Fork 0
b. Module assign
The Module assign
is used on any input file, stream, or path (--scan_input
required for "path"; will convert path to a list of content at that path). It will connect to the database (or metadata-file) and by default match accession numbers in the input to data in user-specified columns. Columns can be selected as a single value, a comma-separated list, or as a formatter-string with hashtags denoting keys/columns in the database. There are options for how to handle accessions which miss data in the database (see --help
).
For example, this tree with accession numbers can be transformed to a tree with descriptive content ...
(GCF_002099195.1:0.02759,((((((((GCF_902833225.1:0.02552,(GCF_902832885.1:0.02849,GCF_003853705.1:0.02720):0.00068):0.00125,GCF_902499125.1:0.02845):0.00479,((((GCF_902833145.1:0.02595,GCF_001522585.2:0.02633):0.00034,(GCF_900446215.1:0.02284,GCF_014211915.1:0.02371):0.00176):0.00020,(GCF_016899425.1:0.02452,GCF_001547525.1:0.02527):0.00213):0.01022,(((((((GCF_902833085.1:0.02340,GCF_002097715.1:0.02327):0.01642,(GCF_902830275.1:0.02767,GCF_004343775.1:0.02945):0.01041):0.00417,GCF_003854275.1:0.04945):0.00450,(((((((GCF_900176645.1:0.01428,(((((((((((((GCF_014857925.1:0.02724,GCA_020629575.1:0.03144):0.02106,GCF_003428165.1:0.04207):0.01582,GCF_008711525.1:0.01004):0.02959,GCF_003574425.1:0.07988):0.04265,(GCF_003428155.1:0.11606,GCF_002803295.2:0.11648):0.01451):0.01625,GCF_000379445.1:0.15428):0.01906,GCF_003574485.1:0.10935):0.02542,(((((((GCF_012224145.1:0.04143,GCF_008711465.1:0.05362):0.00972,GCF_001895265.1:0.05662):0.01825,GCF_002211785.1:0.06785):0.02218,GCF_000764555.1:0.09965):0.01443,(GCF_009823375.1:0.01017,(GCF_007923265.1:0.04074,(GCF_001042525.2:0.03129,GCF_000156715.1:0.02948):0.00928):0.01491):0.05400):0.00937,((((GCF_012222965.1:0.02251,GCF_012222825.1:0.02799):0.02919,GCF_000815225.1:0.05958):0.06020,GCF_003290445.1:0.12008):0.00179,(GCF_003574475.2:0.02835,GCF_001880225.1:0.02762):0.09411):0.00152):0.01016,((GCF_003347135.1:0.03022,(GCF_001885235.1:0.05571,(GCF_001653955.1:0.02722,GCA_002095075.2:0.02693):0.03730):0.01687):0.01903,(((((((((GCF_000008985.1:0.00006,GCA_000746455.1:0.00010):0.00194,(GCA_016604655.1:0.00001,GCA_000018925.1:0.00001):0.00222):0.00452,GCA_000153845.1:0.00000):0.00090,GCA_000807905.1:0.00000):0.00019,GCA_002886065.1:0.00000):0.00007,GCA_000313385.1:0.00000):0.00041,GCA_015666825.2:0.00000):0.00033,GCA_014931515.1:0.00000):0.00070,GCA_000170295.1:0.00008):0.03469):0.02895):0.06610):0.11320,GCA_913060525.1:0.32717):0.05916,GCA_024397935.1:0.00000):0.08521,GCA_021713295.1:0.01198):0.12581,GCA_021713395.1:0.09073):0.00940,GCF_003697165.2:0.29258):0.16971):0.00789,GCF_001522865.1:0.02337):0.05743,GCF_010645215.1:0.05901):0.01135,((GCF_001524625.2:0.03690,GCF_000959365.1:0.03502):0.00308,((GCF_001462435.1:0.02821,GCF_000012365.1:0.02923):0.00355,GCF_000011705.1:0.03223):0.00652):0.02459):0.00414,(((GCF_017104645.1:0.03576,GCA_009914575.1:0.06284):0.02007,(GCF_000959725.1:0.05207,GCA_009911875.1:0.11954):0.01275):0.00703,(GCF_001411805.1:0.03302,GCF_000960995.1:0.03789):0.02607):0.02386):0.00541,GCF_013403875.1:0.06317):0.00288,(GCF_000687455.2:0.06028,GCA_018375725.1:0.06568):0.00441):0.00746):0.00237,(GCF_001718695.1:0.02261,GCF_000959445.1:0.02359):0.02312):0.00545,(GCF_902499135.1:0.04734,(GCF_902499075.1:0.03394,GCF_000959525.1:0.04583):0.00682):0.00228):0.00541,((((((((GCF_902833055.1:0.02263,GCF_001718315.1:0.02349):0.00554,GCF_902499115.1:0.02562):0.00556,GCF_000949135.2:0.02784):0.00089,((GCF_003853645.1:0.02757,GCF_003853415.1:0.02584):0.00086,GCF_001443045.1:0.02962):0.00442):0.00124,GCF_000203915.1:0.03251):0.00166,((GCF_902499045.1:0.02972,GCF_001718795.1:0.03605):0.00799,GCF_003635165.1:0.03583):0.00158):0.00134,GCA_000292915.1:0.03388):0.00118,GCF_902498995.1:0.03897):0.00426):0.00534):0.00233):0.00095,(((GCF_003330765.1:0.03172,(GCF_001039005.1:0.03046,GCF_001028665.1:0.02839):0.00148):0.00053,GCF_001883705.2:0.03054):0.00147,GCF_001742165.1:0.03581):0.00126):0.00160,GCF_000732615.1:0.04045):0.00343,(GCF_011605345.1:0.02718,(GCF_001718835.1:0.03009,GCF_001411495.1:0.02175):0.00475):0.00333):0.00149,GCF_902832845.1:0.03328):0.00126,(GCF_902833045.1:0.02871,GCF_000987075.1:0.02905):0.00111):0.00061,((GCF_905232215.1:0.02735,GCF_902499175.1:0.02200):0.00817,(GCF_902499335.1:0.02312,GCF_000012945.1:0.02447):0.00079):0.00135);
... by using assign
: ...
assign.py --metadata_file metadata.tsv --column "#family_#genus#_#species_#accession" --input test_genomes.dnd
... resulting in:
(Burkholderiaceae_Burkholderia_puraquae_GCF_002099195.1:0.02759,((((((((Burkholderiaceae_Burkholderia_sp902833225_GCF_902833225.1:0.02552,(Burkholderiaceae_Burkholderia_seminalis_GCF_902832885.1:0.02849,Burkholderiaceae_Burkholderia_sp003853705_GCF_003853705.1:0.02720):0.00068):0.00125,Burkholderiaceae_Burkholderia_arboris_GCF_902499125.1:0.02845):0.00479,((((Burkholderiaceae_Burkholderia_sp902833145_GCF_902833145.1:0.02595,Burkholderiaceae_Burkholderia_sp001522585_GCF_001522585.2:0.02633):0.00034,(Burkholderiaceae_Burkholderia_cenocepacia_GCF_900446215.1:0.02284,Burkholderiaceae_Burkholderia_orbicola_GCF_014211915.1:0.02371):0.00176):0.00020,(Burkholderiaceae_Burkholderia_sp002223185_GCF_016899425.1:0.02452,Burkholderiaceae_Burkholderia_anthina-A_GCF_001547525.1:0.02527):0.00213):0.01022,(((((((Burkholderiaceae_Burkholderia_ubonensis_GCF_902833085.1:0.02340,Burkholderiaceae_Burkholderia_mesacidophila_GCF_002097715.1:0.02327):0.01642,(Burkholderiaceae_Burkholderia_stagnalis_GCF_902830275.1:0.02767,Burkholderiaceae_Burkholderia_sp004343775_GCF_004343775.1:0.02945):0.01041):0.00417,Burkholderiaceae_Burkholderia_sp003854275_GCF_003854275.1:0.04945):0.00450,(((((((Burkholderiaceae_Burkholderia_singularis_GCF_900176645.1:0.01428,(((((((((((((Francisellaceae_Cysteiniphilum_marinum_GCF_014857925.1:0.02724,Francisellaceae_Cysteiniphilum_sp020629575_GCA_020629575.1:0.03144):0.02106,Francisellaceae_Cysteiniphilum_litorale_GCF_003428165.1:0.04207):0.01582,Francisellaceae_Cysteiniphilum_sp008711525_GCF_008711525.1:0.01004):0.02959,Francisellaceae_Cysteiniphilum_halobium_GCF_003574425.1:0.07988):0.04265,(Francisellaceae_Fastidiosibacter_lacustris_GCF_003428155.1:0.11606,Francisellaceae_Caedibacter_taeniospiralis_GCF_002803295.2:0.11648):0.01451):0.01625,Francisellaceae_Fangia_hongkongensis_GCF_000379445.1:0.15428):0.01906,Francisellaceae_Facilibium_subflavum_GCF_003574485.1:0.10935):0.02542,(((((((Francisellaceae_Francisella_sp012224145_GCF_012224145.1:0.04143,Francisellaceae_Francisella_sp008711465_GCF_008711465.1:0.05362):0.00972,Francisellaceae_Francisella_uliginis_GCF_001895265.1:0.05662):0.01825,Francisellaceae_Francisella_halioticida_GCF_002211785.1:0.06785):0.02218,Francisellaceae_Francisella_sp000764555_GCF_000764555.1:0.09965):0.01443,(Francisellaceae_Francisella_noatunensis_GCF_009823375.1:0.01017,(Francisellaceae_Francisella_salimarina_GCF_007923265.1:0.04074,(Francisellaceae_Francisella_orientalis_GCF_001042525.2:0.03129,Francisellaceae_Francisella_philomiragia_GCF_000156715.1:0.02948):0.00928):0.01491):0.05400):0.00937,((((Francisellaceae_Allofrancisella_inopinata_GCF_012222965.1:0.02251,Francisellaceae_Allofrancisella_frigidaquae_GCF_012222825.1:0.02799):0.02919,Francisellaceae_Allofrancisella_guangzhouensis_GCF_000815225.1:0.05958):0.06020,Francisellaceae_Francisella-A_adeliensis_GCF_003290445.1:0.12008):0.00179,(Francisellaceae_Pseudofrancisella_aestuarii_GCF_003574475.2:0.02835,Francisellaceae_Pseudofrancisella_frigiditurris_GCF_001880225.1:0.02762):0.09411):0.00152):0.01016,((Francisellaceae_Francisella_opportunistica_GCF_003347135.1:0.03022,(Francisellaceae_Francisella_hispaniensis_GCF_001885235.1:0.05571,(Francisellaceae_Francisella_persica_GCF_001653955.1:0.02722,Francisellaceae_Francisella_sp002095075_GCA_002095075.2:0.02693):0.03730):0.01687):0.01903,(((((((((Francisellaceae_Francisella_tularensis_GCF_000008985.1:0.00006,Francisellaceae_Francisella_tularensis_GCA_000746455.1:0.00010):0.00194,(Francisellaceae_Francisella_tularensis_GCA_016604655.1:0.00001,Francisellaceae_Francisella_tularensis_GCA_000018925.1:0.00001):0.00222):0.00452,Francisellaceae_Francisella_tularensis_GCA_000153845.1:0.00000):0.00090,Francisellaceae_Francisella_tularensis_GCA_000807905.1:0.00000):0.00019,Francisellaceae_Francisella_tularensis_GCA_002886065.1:0.00000):0.00007,Francisellaceae_Francisella_tularensis_GCA_000313385.1:0.00000):0.00041,Francisellaceae_Francisella_tularensis_GCA_015666825.2:0.00000):0.00033,Francisellaceae_Francisella_tularensis_GCA_014931515.1:0.00000):0.00070,Francisellaceae_Francisella_tularensis_GCA_000170295.1:0.00008):0.03469):0.02895):0.06610):0.11320,Francisellaceae_CAJXRW01_sp913060525_GCA_913060525.1:0.32717):0.05916,Francisellaceae_M0027_sp006227905_GCA_024397935.1:0.00000):0.08521,Francisellaceae_M0027_sp021713295_GCA_021713295.1:0.01198):0.12581,Francisellaceae_M0027_sp021713395_GCA_021713395.1:0.09073):0.00940,Enterobacteriaceae_Escherichia_coli_GCF_003697165.2:0.29258):0.16971):0.00789,Burkholderiaceae_Burkholderia_sp001522865_GCF_001522865.1:0.02337):0.05743,Burkholderiaceae_Burkholderia_multivorans-A_GCF_010645215.1:0.05901):0.01135,((Burkholderiaceae_Burkholderia_savannae_GCF_001524625.2:0.03690,Burkholderiaceae_Burkholderia_oklahomensis_GCF_000959365.1:0.03502):0.00308,((Burkholderiaceae_Burkholderia_humptydooensis_GCF_001462435.1:0.02821,Burkholderiaceae_Burkholderia_thailandensis_GCF_000012365.1:0.02923):0.00355,Burkholderiaceae_Burkholderia_mallei_GCF_000011705.1:0.03223):0.00652):0.02459):0.00414,(((Burkholderiaceae_Burkholderia_sp017104645_GCF_017104645.1:0.03576,Burkholderiaceae_Burkholderia_gladioli-B_GCA_009914575.1:0.06284):0.02007,(Burkholderiaceae_Burkholderia_gladioli_GCF_000959725.1:0.05207,Burkholderiaceae_Burkholderia_gladioli-A_GCA_009911875.1:0.11954):0.01275):0.00703,(Burkholderiaceae_Burkholderia_plantarii_GCF_001411805.1:0.03302,Burkholderiaceae_Burkholderia_glumae_GCF_000960995.1:0.03789):0.02607):0.02386):0.00541,Burkholderiaceae_Burkholderia_guangdongensis_GCF_013403875.1:0.06317):0.00288,(Burkholderiaceae_Burkholderia_sp000687455_GCF_000687455.2:0.06028,Burkholderiaceae_Burkholderia_sp018375725_GCA_018375725.1:0.06568):0.00441):0.00746):0.00237,(Burkholderiaceae_Burkholderia_ubonensis-B_GCF_001718695.1:0.02261,Burkholderiaceae_Burkholderia_vietnamiensis_GCF_000959445.1:0.02359):0.02312):0.00545,(Burkholderiaceae_Burkholderia_dolosa_GCF_902499135.1:0.04734,(Burkholderiaceae_Burkholderia_pseudomultivorans_GCF_902499075.1:0.03394,Burkholderiaceae_Burkholderia_multivorans_GCF_000959525.1:0.04583):0.00682):0.00228):0.00541,((((((((Burkholderiaceae_Burkholderia_territorii_GCF_902833055.1:0.02263,Burkholderiaceae_Burkholderia_diffusa-B_GCF_001718315.1:0.02349):0.00554,Burkholderiaceae_Burkholderia_diffusa_GCF_902499115.1:0.02562):0.00556,Burkholderiaceae_Burkholderia_sp000949135_GCF_000949135.2:0.02784):0.00089,((Burkholderiaceae_Burkholderia_sp003853645_GCF_003853645.1:0.02757,Burkholderiaceae_Burkholderia_sp003853415_GCF_003853415.1:0.02584):0.00086,Burkholderiaceae_Burkholderia_ambifaria-A_GCF_001443045.1:0.02962):0.00442):0.00124,Burkholderiaceae_Burkholderia_ambifaria_GCF_000203915.1:0.03251):0.00166,((Burkholderiaceae_Burkholderia_latens_GCF_902499045.1:0.02972,Burkholderiaceae_Burkholderia_latens-A_GCF_001718795.1:0.03605):0.00799,Burkholderiaceae_Burkholderia_sp003635165_GCF_003635165.1:0.03583):0.00158):0.00134,Burkholderiaceae_Burkholderia_cepacia-D_GCA_000292915.1:0.03388):0.00118,Burkholderiaceae_Burkholderia_anthina_GCF_902498995.1:0.03897):0.00426):0.00534):0.00233):0.00095,(((Burkholderiaceae_Burkholderia_pyrrocinia-B_GCF_003330765.1:0.03172,(Burkholderiaceae_Burkholderia_cepacia-C_GCF_001039005.1:0.03046,Burkholderiaceae_Burkholderia_pyrrocinia_GCF_001028665.1:0.02839):0.00148):0.00053,Burkholderiaceae_Burkholderia_catarinensis_GCF_001883705.2:0.03054):0.00147,Burkholderiaceae_Burkholderia_stabilis_GCF_001742165.1:0.03581):0.00126):0.00160,Burkholderiaceae_Burkholderia_paludis_GCF_000732615.1:0.04045):0.00343,(Burkholderiaceae_Burkholderia_sp011605345_GCF_011605345.1:0.02718,(Burkholderiaceae_Burkholderia_cepacia-F_GCF_001718835.1:0.03009,Burkholderiaceae_Burkholderia_cepacia_GCF_001411495.1:0.02175):0.00475):0.00333):0.00149,Burkholderiaceae_Burkholderia_metallica_GCF_902832845.1:0.03328):0.00126,(Burkholderiaceae_Burkholderia_sp902833045_GCF_902833045.1:0.02871,Burkholderiaceae_Burkholderia_contaminans_GCF_000987075.1:0.02905):0.00111):0.00061,((Burkholderiaceae_Burkholderia_sp905232215_GCF_905232215.1:0.02735,Burkholderiaceae_Burkholderia_aenigmatica_GCF_902499175.1:0.02200):0.00817,(Burkholderiaceae_Burkholderia_lata-B_GCF_902499335.1:0.02312,Burkholderiaceae_Burkholderia_lata_GCF_000012945.1:0.02447):0.00079):0.00135);
The following command using a streamed input from cat
produce the same output, here specified to a file:
cat test_genomes.dnd | assign.py --metadata_file metadata.tsv --column "#family_#genus#_#species_#accession" --input - --output readable_tree.dnd
assign
can be piped to other softwares (taking "-" as input-argument). Example below when listing the contents of a folder of fasta-files:
# Input directory files
ls test_genomes
GCA_000018925.1.fasta GCA_000153845.1.fasta GCA_000170295.1.fasta GCA_000292915.1.fasta GCA_000313385.1.fasta GCA_000746455.1.fasta GCA_000807905.1.fasta GCA_002095075.2.fasta GCA_002886065.1.fasta GCA_009911875.1.fasta GCA_009914575.1.fasta GCA_014931515.1.fasta GCA_015666825.2.fasta GCA_016604655.1.fasta GCA_018375725.1.fasta GCA_020629575.1.fasta GCA_021713295.1.fasta GCA_021713395.1.fasta GCA_024397935.1.fasta GCA_913060525.1.fasta GCF_000008985.1.fasta GCF_000011705.1.fasta GCF_000012365.1.fasta GCF_000012945.1.fasta GCF_000156715.1.fasta GCF_000203915.1.fasta GCF_000379445.1.fasta GCF_000687455.2.fasta GCF_000732615.1.fasta GCF_000764555.1.fasta GCF_000815225.1.fasta GCF_000949135.2.fasta GCF_000959365.1.fasta GCF_000959445.1.fasta GCF_000959525.1.fasta GCF_000959725.1.fasta GCF_000960995.1.fasta GCF_000987075.1.fasta GCF_001028665.1.fasta GCF_001039005.1.fasta GCF_001042525.2.fasta GCF_001411495.1.fasta GCF_001411805.1.fasta GCF_001443045.1.fasta GCF_001462435.1.fasta GCF_001522585.2.fasta GCF_001522865.1.fasta GCF_001524625.2.fasta GCF_001547525.1.fasta GCF_001653955.1.fasta GCF_001718315.1.fasta GCF_001718695.1.fasta GCF_001718795.1.fasta GCF_001718835.1.fasta GCF_001742165.1.fasta GCF_001880225.1.fasta GCF_001883705.2.fasta GCF_001885235.1.fasta GCF_001895265.1.fasta GCF_002097715.1.fasta GCF_002099195.1.fasta GCF_002211785.1.fasta GCF_002803295.2.fasta GCF_003290445.1.fasta GCF_003330765.1.fasta GCF_003347135.1.fasta GCF_003428155.1.fasta GCF_003428165.1.fasta GCF_003574425.1.fasta GCF_003574475.2.fasta GCF_003574485.1.fasta GCF_003635165.1.fasta GCF_003697165.2.fasta GCF_003853415.1.fasta GCF_003853645.1.fasta GCF_003853705.1.fasta GCF_003854275.1.fasta GCF_004343775.1.fasta GCF_007923265.1.fasta GCF_008711465.1.fasta GCF_008711525.1.fasta GCF_009823375.1.fasta GCF_010645215.1.fasta GCF_011605345.1.fasta GCF_012222825.1.fasta GCF_012222965.1.fasta GCF_012224145.1.fasta GCF_013403875.1.fasta GCF_014211915.1.fasta GCF_014857925.1.fasta GCF_016899425.1.fasta GCF_017104645.1.fasta GCF_900176645.1.fasta GCF_900446215.1.fasta GCF_902498995.1.fasta GCF_902499045.1.fasta GCF_902499075.1.fasta GCF_902499115.1.fasta GCF_902499125.1.fasta GCF_902499135.1.fasta GCF_902499175.1.fasta GCF_902499335.1.fasta GCF_902830275.1.fasta GCF_902832845.1.fasta GCF_902832885.1.fasta GCF_902833045.1.fasta GCF_902833055.1.fasta GCF_902833085.1.fasta GCF_902833145.1.fasta GCF_902833225.1.fasta GCF_905232215.1.fasta
# List files the input directory and transform the output
ls test_genomes | ./assign.py --metadata_file metadata.tsv --column "#family_#genus#_#species_#accession" --input -
Francisellaceae_Francisella_tularensis_GCA_000018925.1.fasta Francisellaceae_Francisella_tularensis_GCA_000153845.1.fasta Francisellaceae_Francisella_tularensis_GCA_000170295.1.fasta Burkholderiaceae_Burkholderia_cepacia-D_GCA_000292915.1.fasta Francisellaceae_Francisella_tularensis_GCA_000313385.1.fasta Francisellaceae_Francisella_tularensis_GCA_000746455.1.fasta Francisellaceae_Francisella_tularensis_GCA_000807905.1.fasta Francisellaceae_Francisella_sp002095075_GCA_002095075.2.fasta Francisellaceae_Francisella_tularensis_GCA_002886065.1.fasta Burkholderiaceae_Burkholderia_gladioli-A_GCA_009911875.1.fasta Burkholderiaceae_Burkholderia_gladioli-B_GCA_009914575.1.fasta Francisellaceae_Francisella_tularensis_GCA_014931515.1.fasta Francisellaceae_Francisella_tularensis_GCA_015666825.2.fasta Francisellaceae_Francisella_tularensis_GCA_016604655.1.fasta Burkholderiaceae_Burkholderia_sp018375725_GCA_018375725.1.fasta Francisellaceae_Cysteiniphilum_sp020629575_GCA_020629575.1.fasta Francisellaceae_M0027_sp021713295_GCA_021713295.1.fasta Francisellaceae_M0027_sp021713395_GCA_021713395.1.fasta Francisellaceae_M0027_sp006227905_GCA_024397935.1.fasta Francisellaceae_CAJXRW01_sp913060525_GCA_913060525.1.fasta Francisellaceae_Francisella_tularensis_GCF_000008985.1.fasta Burkholderiaceae_Burkholderia_mallei_GCF_000011705.1.fasta Burkholderiaceae_Burkholderia_thailandensis_GCF_000012365.1.fasta Burkholderiaceae_Burkholderia_lata_GCF_000012945.1.fasta Francisellaceae_Francisella_philomiragia_GCF_000156715.1.fasta Burkholderiaceae_Burkholderia_ambifaria_GCF_000203915.1.fasta Francisellaceae_Fangia_hongkongensis_GCF_000379445.1.fasta Burkholderiaceae_Burkholderia_sp000687455_GCF_000687455.2.fasta Burkholderiaceae_Burkholderia_paludis_GCF_000732615.1.fasta Francisellaceae_Francisella_sp000764555_GCF_000764555.1.fasta Francisellaceae_Allofrancisella_guangzhouensis_GCF_000815225.1.fasta Burkholderiaceae_Burkholderia_sp000949135_GCF_000949135.2.fasta Burkholderiaceae_Burkholderia_oklahomensis_GCF_000959365.1.fasta Burkholderiaceae_Burkholderia_vietnamiensis_GCF_000959445.1.fasta Burkholderiaceae_Burkholderia_multivorans_GCF_000959525.1.fasta Burkholderiaceae_Burkholderia_gladioli_GCF_000959725.1.fasta Burkholderiaceae_Burkholderia_glumae_GCF_000960995.1.fasta Burkholderiaceae_Burkholderia_contaminans_GCF_000987075.1.fasta Burkholderiaceae_Burkholderia_pyrrocinia_GCF_001028665.1.fasta Burkholderiaceae_Burkholderia_cepacia-C_GCF_001039005.1.fasta Francisellaceae_Francisella_orientalis_GCF_001042525.2.fasta Burkholderiaceae_Burkholderia_cepacia_GCF_001411495.1.fasta Burkholderiaceae_Burkholderia_plantarii_GCF_001411805.1.fasta Burkholderiaceae_Burkholderia_ambifaria-A_GCF_001443045.1.fasta Burkholderiaceae_Burkholderia_humptydooensis_GCF_001462435.1.fasta Burkholderiaceae_Burkholderia_sp001522585_GCF_001522585.2.fasta Burkholderiaceae_Burkholderia_sp001522865_GCF_001522865.1.fasta Burkholderiaceae_Burkholderia_savannae_GCF_001524625.2.fasta Burkholderiaceae_Burkholderia_anthina-A_GCF_001547525.1.fasta Francisellaceae_Francisella_persica_GCF_001653955.1.fasta Burkholderiaceae_Burkholderia_diffusa-B_GCF_001718315.1.fasta Burkholderiaceae_Burkholderia_ubonensis-B_GCF_001718695.1.fasta Burkholderiaceae_Burkholderia_latens-A_GCF_001718795.1.fasta Burkholderiaceae_Burkholderia_cepacia-F_GCF_001718835.1.fasta Burkholderiaceae_Burkholderia_stabilis_GCF_001742165.1.fasta Francisellaceae_Pseudofrancisella_frigiditurris_GCF_001880225.1.fasta Burkholderiaceae_Burkholderia_catarinensis_GCF_001883705.2.fasta Francisellaceae_Francisella_hispaniensis_GCF_001885235.1.fasta Francisellaceae_Francisella_uliginis_GCF_001895265.1.fasta Burkholderiaceae_Burkholderia_mesacidophila_GCF_002097715.1.fasta Burkholderiaceae_Burkholderia_puraquae_GCF_002099195.1.fasta Francisellaceae_Francisella_halioticida_GCF_002211785.1.fasta Francisellaceae_Caedibacter_taeniospiralis_GCF_002803295.2.fasta Francisellaceae_Francisella-A_adeliensis_GCF_003290445.1.fasta Burkholderiaceae_Burkholderia_pyrrocinia-B_GCF_003330765.1.fasta Francisellaceae_Francisella_opportunistica_GCF_003347135.1.fasta Francisellaceae_Fastidiosibacter_lacustris_GCF_003428155.1.fasta Francisellaceae_Cysteiniphilum_litorale_GCF_003428165.1.fasta Francisellaceae_Cysteiniphilum_halobium_GCF_003574425.1.fasta Francisellaceae_Pseudofrancisella_aestuarii_GCF_003574475.2.fasta Francisellaceae_Facilibium_subflavum_GCF_003574485.1.fasta Burkholderiaceae_Burkholderia_sp003635165_GCF_003635165.1.fasta Enterobacteriaceae_Escherichia_coli_GCF_003697165.2.fasta Burkholderiaceae_Burkholderia_sp003853415_GCF_003853415.1.fasta Burkholderiaceae_Burkholderia_sp003853645_GCF_003853645.1.fasta Burkholderiaceae_Burkholderia_sp003853705_GCF_003853705.1.fasta Burkholderiaceae_Burkholderia_sp003854275_GCF_003854275.1.fasta Burkholderiaceae_Burkholderia_sp004343775_GCF_004343775.1.fasta Francisellaceae_Francisella_salimarina_GCF_007923265.1.fasta Francisellaceae_Francisella_sp008711465_GCF_008711465.1.fasta Francisellaceae_Cysteiniphilum_sp008711525_GCF_008711525.1.fasta Francisellaceae_Francisella_noatunensis_GCF_009823375.1.fasta Burkholderiaceae_Burkholderia_multivorans-A_GCF_010645215.1.fasta Burkholderiaceae_Burkholderia_sp011605345_GCF_011605345.1.fasta Francisellaceae_Allofrancisella_frigidaquae_GCF_012222825.1.fasta Francisellaceae_Allofrancisella_inopinata_GCF_012222965.1.fasta Francisellaceae_Francisella_sp012224145_GCF_012224145.1.fasta Burkholderiaceae_Burkholderia_guangdongensis_GCF_013403875.1.fasta Burkholderiaceae_Burkholderia_orbicola_GCF_014211915.1.fasta Francisellaceae_Cysteiniphilum_marinum_GCF_014857925.1.fasta Burkholderiaceae_Burkholderia_sp002223185_GCF_016899425.1.fasta Burkholderiaceae_Burkholderia_sp017104645_GCF_017104645.1.fasta Burkholderiaceae_Burkholderia_singularis_GCF_900176645.1.fasta Burkholderiaceae_Burkholderia_cenocepacia_GCF_900446215.1.fasta Burkholderiaceae_Burkholderia_anthina_GCF_902498995.1.fasta Burkholderiaceae_Burkholderia_latens_GCF_902499045.1.fasta Burkholderiaceae_Burkholderia_pseudomultivorans_GCF_902499075.1.fasta Burkholderiaceae_Burkholderia_diffusa_GCF_902499115.1.fasta Burkholderiaceae_Burkholderia_arboris_GCF_902499125.1.fasta Burkholderiaceae_Burkholderia_dolosa_GCF_902499135.1.fasta Burkholderiaceae_Burkholderia_aenigmatica_GCF_902499175.1.fasta Burkholderiaceae_Burkholderia_lata-B_GCF_902499335.1.fasta Burkholderiaceae_Burkholderia_stagnalis_GCF_902830275.1.fasta Burkholderiaceae_Burkholderia_metallica_GCF_902832845.1.fasta Burkholderiaceae_Burkholderia_seminalis_GCF_902832885.1.fasta Burkholderiaceae_Burkholderia_sp902833045_GCF_902833045.1.fasta Burkholderiaceae_Burkholderia_territorii_GCF_902833055.1.fasta Burkholderiaceae_Burkholderia_ubonensis_GCF_902833085.1.fasta Burkholderiaceae_Burkholderia_sp902833145_GCF_902833145.1.fasta Burkholderiaceae_Burkholderia_sp902833225_GCF_902833225.1.fasta Burkholderiaceae_Burkholderia_sp905232215_GCF_905232215.1.fasta
In this example, the files of a directory is listed. That stream if first named as "family_genus_species_accession". The next stream is going to be named "accession-species" and have a dash as separator instead of an underscore. We need to apply --clean_names
for the software to understand that it must remove any adjacent text to the accession number:
# List files the input directory
ls test_genomes | head -n5
GCA_000018925.1.fasta
GCA_000153845.1.fasta
GCA_000170295.1.fasta
GCA_000292915.1.fasta
GCA_000313385.1.fasta
# List files the input directory and transform the output
ls test_genomes | head -n5 | ./assign.py --metadata_file metadata.tsv --column "#family_#genus#_#species_#accession" --input -
Francisellaceae_Francisella_tularensis_GCA_000018925.1.fasta
Francisellaceae_Francisella_tularensis_GCA_000153845.1.fasta
Francisellaceae_Francisella_tularensis_GCA_000170295.1.fasta
Burkholderiaceae_Burkholderia_cepacia-D_GCA_000292915.1.fasta
Francisellaceae_Francisella_tularensis_GCA_000313385.1.fasta
# List files the input directory and transform the output again with another output format
ls test_genomes | head -n5 | \
./assign.py --metadata_file metadata.tsv --column "#family_#genus#_#species_#accession" --input - | \
./assign.py --metadata_file metadata.tsv --column "#accession-#species" --input - --clean_names
GCA_000018925.1-tularensis.fasta
GCA_000153845.1-tularensis.fasta
GCA_000170295.1-tularensis.fasta
GCA_000292915.1-cepacia-D.fasta
GCA_000313385.1-tularensis.fasta
Suppose there are files listed in the directory test_genomes
. These files can be copied-and-named into a new directory:
# Show contents of input directory
find test_genomes -name "*fasta"
test_genomes/GCF_000687455.2.fasta
test_genomes/GCF_001039005.1.fasta
..
..
test_genomes/GCA_009914575.1.fasta
test_genomes/GCF_001653955.1.fasta
# Find and copy files as renamed files in new folder
find test_genomes -name "*fasta" | ./assign.py --metadata_file metadata.tsv --columns "#family_#genus_#species_#accession" --input - --rename_files_dir test_genomes_renamed
# Show contents of output directory
ls test_genomes_renamed
Burkholderiaceae_Burkholderia_sp000687455_GCF_000687455.2.fasta
Burkholderiaceae_Burkholderia_cepacia-C_GCF_001039005.1.fasta
..
..
Burkholderiaceae_Burkholderia_gladioli-B_GCA_009914575.1.fasta
Francisellaceae_Francisella_persica_GCF_001653955.1.fasta
Alternatively, the "find-assign" pipe is excluded by asking assign
to interpret the input as a path and scan its contents:
./assign.py --metadata_file metadata.tsv --columns "#family_#genus_#species_#accession" --input test_genomes --scan_input --rename_files_dir test_genomes_renamed
Accessions that do not have a matching entry in the database are by default passed to the output without formatting. The user can get information about the number of matched and missing accessions of the input by invoking the --notify_missing
argument. Missing accession numbers may be skipped by --skip_missing
or printed as the output by --print_missing
.
It can produce an ITOL naming file:
assign.py --metadata_file metadata.tsv --column "#family_#genus#_#species_#accession" --input test_genomes.dnd --itol_names
Or a coloring-file (here coloring families)
assign.py --metadata_file metadata.tsv --column "#family" --input test_genomes.dnd --itol_colors
The input need not be a tree-file, it can also be the input of a folder or file:
ls my_interesting_genome_files/ | assign.py --metadata_file metadata.tsv --column "#family_#genus#_#species_#accession" --input - --itol_names
grep "francisella" metadata.tsv | cut -f1 | assign.py --metadata_file metadata.tsv --column "#family_#genus#_#species_#accession" --input - --itol_names
If the accessions parsed from an input should match node-names in the tree that are not formatted as "accession number" (e.g., they are named as <family>_<genus>_<species>_<accession>
) then --keep_input_names
will include the full name to the ITOL mapping-files.
Figure: Shown above is the raw tree
Figure: Shown above is the tree after importing ITOL naming file and coloring file
Custom identifiers are supplied by the --id_list
argument. One identifier per row is expected.
Download genome fasta and metadata CSV file. Example phylogeny computed using MAFFT and IQ-TREE:
Figure: Tree view of IQ-TREE .treefile
Using FlexMetR and BV-BRC metadata CSV, add information to treefile:
# Get IDs from downloaded fasta (FlexMetR needs to know which IDs to look for in the metadata file)
grep ">" BVBRC_orthopoxvirus.fasta | cut -d ' ' -f1 | cut -d '|' -f2 > ids.list
# Add metadata with FlexMetR using BV-BRC CSV file (genome accessions exist in column 43)
flexmetr_alpha assign --input BVBRC_orthopoxvirus.treefile --id_list ids.list --metadata_file BVBRC_orthopoxvirus.csv --metadata_file_sep , --metadata_file_accession 43 --cols "#Genome Name-#GenBank Accessions" --clean_bvbrc > BVBRC_orthopoxvirus.renamed.treefile
Figure: Tree view with added metadata