This is an early-development version of TRASH 2, update of https://github.com/vlothec/TRASH software for repeat identification
- Classification of repeats to repeat families/classes across the fasta file
- Better mapping of very short and very long repeats
- Additional polishing steps for the repeats found at array edges
- Better parallelisation
- Better error diagnostics and runtime progress reporting
- Full re-write with updates to all algorithms
- R needs to be installed.
- mafft and nhmmer need to be installed.
- Running
TRASH.R
from the/src/
directory for the first time will install the required R packages (if they're missing). See below for the required run settings
TRASH.R
needs to be called directly from its directory, or added to the PATH variable for easy access
If TRASH.R
does not execute, add permissions by chmod +x ./TRASH.R
. Using Rscript ./TRASH.R
might be necessary if R code is not being recognised
mafft and nhmmer need to be installed and added to the PATH variable. Alternatively, both can be installed locally and their paths can be added to the src/main.R
script, replacing lines 12 and 13 on Windows or 15 and 16 on Linux.
Windows installation of nhmmer will require a Unix-like enviroment interface like Cygwin.
mafft Windows version is available and can be used by uncommenting line 10 of the src/main.R
script
TRASH is run through the TRASH.R
script founr in the /src/
directory, with fasta file and output directory arguments:
-o --output output directory
-f --fasta file to process
-p --cores_no number of cores for parallel run, default: 1
-m --max_rep_size maximum repeat size, default: 1000
-i --min_rep_size minimum repeat size, default: 7
-t --templates fasta file with repeat templates and their names
├── [fasta_file]
│ ├── [fasta_file]_repeats_with_seq.csv main output file with identified repeats
│ ├── [fasta_file]_repeats.gff main output repeat file in gff format
│ ├── [fasta_file]_repeats.csv main output file with identified repeats without sequence column
│ ├── [fasta_file]_arrays.csv repeat arrays, start and end are not perfectly aligned with repeats, but can be used to get locations of repeats without loading in potentially big repeat files
│ ├── [fasta_file]_arrays.gff repeat arrays as above, in gff format
│ ├── [fasta_file]_run_time.csv report of the script run time
│ ├── [fasta_file]_regarrays.csv temp file, can be removed
│ ├── [fasta_file]_aregarrays.csv temp file, can be removed
│ ├── [fasta_file]_classarrays.csv temp file, can be removed
│ └── [fasta_file]_no_repeats_arrays.csv temp file, can be removed
HORT.R
instead of TRASH.R
command should be used, with following arguments:
--output_folder", "o", 1, "character",
--hor_threshold", "t", 2, "integer",
--hor_min_len", "l", 2, "integer",
--class", "c", 1, "character",
--repeats", "r", 1, "character",
--method", "m", 1, "integer",
--chrA", "A", 1, "character",
--chrB", "B", 2, "character",
--repeatsB", "b", 2, "character",
--classB", "C", 2, "character",
--genomeA", "g", 1, "character",
--genomeB", "G", 2, "character",
--saveR", "s", 2, "character",
--plot_simple", "p", 2, "character"