MILAB-3578 Refactor out easy-cluster #51

lorcai · 2025-08-08T11:15:30Z

Description

This PR refactors the mmseqs easy-cluster command into three separate, equivalent calls: mmseqs createdb, mmseqs cluster, and mmseqs createtsv. This change was made to ensure compatibility with Windows environments as the easy-cluster workflow relies on a temporary .sh script that breaks on windows.

Because the mmseqs createdb and mmseqs cluster commands generate multiple input.db.* files and a variable number of cluster.db.[int] files (as many as threads) respectively, we need to use saveFileSet(), getFileSet() and addFiles(). Currently, this requires separating the exec.builder()'s of the commands into different templates so that when template is rendered these will resolved into a proper map that can be passed to addFiles() correctly in the next step.

The changes also include fixes to ensure that the output is identical to the original easy-cluster command and that logs for all three commands are correctly displayed.

Update:

Modify mmseqs createtsv flags to --createdb-mode 0 to avoid soft links, and --shuffle 0 for result consistency

stdout from mmseqs easy-cluster show it runs with --shuffle 1 and --createdb-mode 1, these are actually incompatible and --shuffle in this case is automatically set to 0.

Additionally, --createdb-mode 1 seems to raise problems on windows due to use of soft links, so it's set to 0. In this case --shuffle 1 influences the results and must be set to 0 to match mmseqs easy-cluster results.

The new flags should solve file access error on windows while maintaining the previous clustering results.

For context, from mmseqs createdb --help:

--shuffle BOOL        Shuffle input database [1]
--createdb-mode INT   Createdb mode 0: copy data, 1: soft link data and write new index (works only with single line fasta/q) [0]

lorcai added 5 commits August 7, 2025 19:08

Refactor easy-cluster for separate mmseqs commands

3543bc4

Fix to make results identical to easy-cluster

908151a

Fix to show logs for the 3 commands

71722d9

changeset

a0ac1bd

Avoid soft links during mmseqs createdb step

444e5f7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MILAB-3578 Refactor out easy-cluster #51

MILAB-3578 Refactor out easy-cluster #51

Uh oh!

lorcai commented Aug 8, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MILAB-3578 Refactor out easy-cluster #51

Are you sure you want to change the base?

MILAB-3578 Refactor out easy-cluster #51

Uh oh!

Conversation

lorcai commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Update:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lorcai commented Aug 8, 2025 •

edited

Loading