Do not require copying/linking/renaming input data

### Expected behavior

The user should be able to invoke the metagenomics workflow from any arbitrary directory, assuming the correct absolute or relative path to relevant Snakefile(s) is indicated. The user should not be required to copy data into a specific directory within the metagenomics workflow code distribution, nor should they be required to rename input file names.

### Actual behavior

Currently, some (or all) of the workflows require the user to copy data into a particular directory in the code distribution. It appears that input data files are expected to adhere to particular file naming conventions. Both requirements introduce an unnecessary logistical burden on the user.

-----------

I recently encountered a similar issue in a Snakemake workflow I was implementing. Two requirements seemed to be at odds:

- ease-of-use for end users (not making them jump through too many hoops to run the workflow)
- setting up sufficient constraints on input data to enable processing with a Snakemake workflow

My first thought was to restrict input filenames to specific patterns as well, but after a bit of work I was able to come up with a different approach that requires a lot less of the user.

- The user specifies the input files with the `config.json` configfile. There is infinite flexibility here: you can enable an arbitrary number of input samples, and an arbitrary number of input files per sample. There is no need to require that the filenames have a particular extension.
- The first rule implemented by the Snakemake workflow is to create symlinks to the input files in the workflow working directory (configurable with snakemake's `--directory` flag). The symlink files are named in a standardize fashion that I as the workflow developer decide. (While I generally prefer to implement Snakemake rules as `shell` commands, I implemented *this* rule in Python so I could more easily handle the input configuration dynamically.)
- All subsequent steps in the workflow point to these symlinks instead of the user-specified input files. 

The Snakefile is [here](https://github.com/dib-lab/kevlar/blob/76a82fc029b2c271c61593ab4fd7fc58acaf2062/kevlar/workflows/bam-preproc/Snakefile) in case you're interested, with the corresponding config template [here](https://github.com/dib-lab/kevlar/blob/76a82fc029b2c271c61593ab4fd7fc58acaf2062/kevlar/workflows/bam-preproc/config.json). In this example config, all the input BAM files are in the same directory and have the same extension, but the way this workflow is implemented it would still work if each one was in a different directory with non-standard names.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Do not require copying/linking/renaming input data #1

Expected behavior

Actual behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Do not require copying/linking/renaming input data #1

Description

Expected behavior

Actual behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions