Dereplicate looooooong sequences!
If you want to get rid of duplicate long sequences (i.e. contigs that are exact substrings of some other contigs), derep_seqs is the tool for you!
Download the source code (either with git clone or by downloading a release), cd into the source directory, and then use make to build it.
git clone https://github.com/mooreryan/derep_seqs.git
cd derep_seqs
make
This will install derep_seqs to the bin directory in the source directory. You can now move derep_seqs and sort_fasta to somewhere on your path if you'd like.
derep_seqs <num worker threads> <seqs.fasta> > seqs.derep.fa
The fasta file must be sorted by increasing sequence length. The program sort_fasta (included in the bin directory) will do this for you.
$ bin/derep_seqs 10 <(bin/sort_fasta contigs.fasta) > contigs.derep.fa
That's it!
- 0: Success
- 1: Argument error
- 2: Couldn't open a file
- 3: Error creating thread
- 4: Error joining thread
- v0.1.0: First release
- v0.2.0: Sort on decreasing seq length. Use greedy algorithm. Prefilter. Use hash3 instead of SSEF.
- v0.3.0: Use hashing for prefiltering.
- v0.4.0: Don't store hash vals...uses way less memory :) but it's slow again :(
- v0.5.0: Use pthreads for multithreading!
- v0.6.0: Make prefilter length a tunable option
- v0.7.0: Use Rabin-Karp search for filtering