Skip to content

Conversation

@lorcai
Copy link
Contributor

@lorcai lorcai commented Aug 8, 2025

Description

This PR refactors the mmseqs easy-cluster command into three separate, equivalent calls: mmseqs createdb, mmseqs cluster, and mmseqs createtsv. This change was made to ensure compatibility with Windows environments as the easy-cluster workflow relies on a temporary .sh script that breaks on windows.

Because the mmseqs createdb and mmseqs cluster commands generate multiple input.db.* files and a variable number of cluster.db.[int] files (as many as threads) respectively, we need to use saveFileSet(), getFileSet() and addFiles(). Currently, this requires separating the exec.builder()'s of the commands into different templates so that when template is rendered these will resolved into a proper map that can be passed to addFiles() correctly in the next step.

The changes also include fixes to ensure that the output is identical to the original easy-cluster command and that logs for all three commands are correctly displayed.

Update:

  • Modify mmseqs createtsv flags to --createdb-mode 0 to avoid soft links, and --shuffle 0 for result consistency

stdout from mmseqs easy-cluster show it runs with --shuffle 1 and --createdb-mode 1, these are actually incompatible and --shuffle in this case is automatically set to 0.

Additionally, --createdb-mode 1 seems to raise problems on windows due to use of soft links, so it's set to 0. In this case --shuffle 1 influences the results and must be set to 0 to match mmseqs easy-cluster results.

The new flags should solve file access error on windows while maintaining the previous clustering results.

For context, from mmseqs createdb --help:

--shuffle BOOL        Shuffle input database [1]
--createdb-mode INT   Createdb mode 0: copy data, 1: soft link data and write new index (works only with single line fasta/q) [0]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants