Calculates a ranking of hmm profiles per sequence.
Given a set of hmmsearch "tblout" result files, this R script outputs a table where each sequence hits against a set of hmm profiles are ranked based on score.
After installing the package with Conda, you can run the script with--help to
get help...
$ conda install -c matricaria hmmrank # (Better in an environment, see Conda documentation)
$ hmmrank.r --help
Usage: ../R/hmmrank.r [options] file0.tblout .. filen.tblout
Ranks tabular output from hmmsearch (HMMER package). Optionally filters based on score and adds annotation information.
This is a brief overview of options, check out https://github.com/erikrikarddaniel/hmmrank/blob/master/README.md for more instructions.
Options:
--annottable=ANNOTTABLE
Name of table with annotation assignments, at a minium must contain 'protein' and 'profile' columns. The 'profile' column might contain multiple profiles, separated by '&'.
--maxscore=MAXSCORE
Maximum sequence score; overrides the sequence score in scorefile if higher than this value, default Inf
--minscore=MINSCORE
Minimum score, default 0
--only_best_scoring
Output only the best scoring, i.e. scores < 2
--outfile=OUTFILE
Name of output file
--qfromfname
Extract query name from file name -- basename until first period -- instead of taking it from the query column in each data file, default false.
--scorefile=SCOREFILE
Name of tsv file containing accept scores (GA in score_type): profile, score_type and seq_score
-v, --verbose
Print progress messages
-V, --version
Print script version number
-h, --help
Show this help message and exit
A file for the --scorefile parameter is a tab separated file looking like this:
profile score_type seq_score domain_score
PF00590 GA 27.80 27.80
PF06180 GA 26.20 26.20
TIGR01466 GA 100.00 100.00
TIGR01467 GA 100.00 100.00
TIGR01469 GA 100.00 100.00
It can be created from a set of Pfam and TIGRFAM hmm files like this:
echo -e "profile\tscore_type\tseq_score\tdomain_score" > scores.tsv; grep '^GA' *.hmm | sed 's/.hmm:/\t/' | sed 's/ \+/\t/g' | sed 's/;//' >> scores.tsv
(If there are other hmm files in the directory, that do not contain GA scores, it doesn't hurt, but they will not be included in the output.)
The above example and command implies that you use the --qfromfname option to the script and that
your output files are named after the profile, minus the .hmm and plus .tblout, e.g.
PF00590.tblout
The score file doesn't have to include all profiles, any that are not there will be set to 0 or
whatever you specify with the --minscore parameter to the script.
If you have a lot of fragmentary sequences, scores in hmm files can be a little too strict (e.g.
TIGRFAMs). You can use --maxscore to limit the minimum score globally. In the output, the correct
score will appear so you can filter afterwards.
You can provide a table with annotation data using the --annottable option. The table must contain
a protein and a profile column (names are case sensitive), but can contain any number of other
columns, which will be added to the output table.
The profile column may contain entries like TIGR01466 & TIGR01467 which will be taken as an
entry that requires both profiles to match a sequence. Scores for the two matches will be summed and
the combined entry will hence receive a better score than any of the two by themselves and the
combination will be ranked higher.
Note that combinations that are observed in the data but not mentioned in the annotation table will
be disregarded. E.g., if PF00590 and TIGR01469 both appear in the annotation table, proteins
that match both profiles will be assigned to whichever of the two they match best.
Also note that score filtering, if asked for, takes place before evaluating combinations. Long
proteins that match multiple profiles may be penalized by this if the target sequence is a fragment.
You may want to lower high minimum scores to deal with this problem; take a particular look at
TIGRFAMs since they typically are long proteins with high default GA scores. See also description of
--maxscore above.