-
Notifications
You must be signed in to change notification settings - Fork 9
Description
Expected behavior
The user should be able to invoke the metagenomics workflow from any arbitrary directory, assuming the correct absolute or relative path to relevant Snakefile(s) is indicated. The user should not be required to copy data into a specific directory within the metagenomics workflow code distribution, nor should they be required to rename input file names.
Actual behavior
Currently, some (or all) of the workflows require the user to copy data into a particular directory in the code distribution. It appears that input data files are expected to adhere to particular file naming conventions. Both requirements introduce an unnecessary logistical burden on the user.
I recently encountered a similar issue in a Snakemake workflow I was implementing. Two requirements seemed to be at odds:
- ease-of-use for end users (not making them jump through too many hoops to run the workflow)
- setting up sufficient constraints on input data to enable processing with a Snakemake workflow
My first thought was to restrict input filenames to specific patterns as well, but after a bit of work I was able to come up with a different approach that requires a lot less of the user.
- The user specifies the input files with the
config.json
configfile. There is infinite flexibility here: you can enable an arbitrary number of input samples, and an arbitrary number of input files per sample. There is no need to require that the filenames have a particular extension. - The first rule implemented by the Snakemake workflow is to create symlinks to the input files in the workflow working directory (configurable with snakemake's
--directory
flag). The symlink files are named in a standardize fashion that I as the workflow developer decide. (While I generally prefer to implement Snakemake rules asshell
commands, I implemented this rule in Python so I could more easily handle the input configuration dynamically.) - All subsequent steps in the workflow point to these symlinks instead of the user-specified input files.
The Snakefile is here in case you're interested, with the corresponding config template here. In this example config, all the input BAM files are in the same directory and have the same extension, but the way this workflow is implemented it would still work if each one was in a different directory with non-standard names.