MILAB-3578 Refactor out easy-cluster #51
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR refactors the
mmseqs easy-clustercommand into three separate, equivalent calls:mmseqs createdb,mmseqs cluster, andmmseqs createtsv. This change was made to ensure compatibility with Windows environments as theeasy-clusterworkflow relies on a temporary.shscript that breaks on windows.Because the
mmseqs createdbandmmseqs clustercommands generate multipleinput.db.*files and a variable number ofcluster.db.[int]files (as many as threads) respectively, we need to usesaveFileSet(),getFileSet()andaddFiles(). Currently, this requires separating theexec.builder()'s of the commands into different templates so that when template is rendered these will resolved into a proper map that can be passed toaddFiles()correctly in the next step.The changes also include fixes to ensure that the output is identical to the original
easy-clustercommand and that logs for all three commands are correctly displayed.Update:
mmseqs createtsvflags to--createdb-mode 0to avoid soft links, and--shuffle 0for result consistencystdout from
mmseqs easy-clustershow it runs with--shuffle 1and--createdb-mode 1, these are actually incompatible and--shufflein this case is automatically set to 0.Additionally,
--createdb-mode 1seems to raise problems on windows due to use of soft links, so it's set to 0. In this case--shuffle 1influences the results and must be set to 0 to matchmmseqs easy-clusterresults.The new flags should solve file access error on windows while maintaining the previous clustering results.
For context, from
mmseqs createdb --help: