An automated bioinformatics pipeline for processing Illumina sequencing data using Bactopia. This project automatically detects new samples in an input directory, tracks their processing status, and runs Bactopia analysis on unprocessed samples using Dagster orchestration.
The pipeline works by:
- Sample Discovery: Scanning the input directory for new FASTQ files
- Sample Tracking: Storing sample metadata in a DuckDB database
- Dynamic Processing: Creating dynamic partitions for each unprocessed sample
- Automated Analysis: Running Bactopia on each sample via a sensor-triggered job
- Progress Monitoring: Providing visual reports of processing status
- Bactopia: Bacterial genome analysis pipeline
- Dagster: Data orchestration platform
- DuckDB: Embedded analytical database
- Pixi: Package and environment management
ubio_autobox/
├── data/
│ ├── database/ # DuckDB databases
│ └── illumina_workflow/
│ ├── input/ # Place FASTQ files here
│ └── output/ # Bactopia results
├── ubio_autobox/
│ ├── assets/
│ │ └── illumina_workflow.py # Main pipeline assets
│ └── definitions.py # Dagster definitions
├── pixi.toml # Environment and dependencies
└── pyproject.toml # Python project configuration
- Pixi for environment management
- Docker (for Bactopia execution)
-
Clone the repository:
git clone https://github.com/ssi-dk/ubio_autobox/ cd ubio_autobox
-
Set up the environment with Pixi:
pixi install
-
Activate the Pixi environment:
pixi shell
-
Install the package in development mode:
pip install -e ".[dev]"
-
Start the Dagster UI:
dagster dev
-
Access the web interface: Open http://localhost:3000 in your browser
-
Add FASTQ files: Place your Illumina FASTQ files (R1/R2 pairs) in
data/illumina_workflow/input/
-
Monitor processing: The sensor will automatically detect new samples and create processing jobs
illumina_samples_in_folder
: Discovers FASTQ files usingbactopia-prepare
new_illumina_samples
: Identifies new samples by comparing with databaseunprocessed_illumina_samples
: Retrieves samples that haven't been processedrun_unprocessed_illumina_sample
: Processes individual samples (partitioned)illumina_samples_plot
: Generates processing status reports
The pipeline uses dynamic partitions to process each sample independently:
- Each unprocessed sample gets its own partition
- Samples can be processed in parallel
- Failed samples don't block others
The unprocessed_illumina_samples_sensor
automatically:
- Detects new unprocessed samples
- Creates dynamic partitions
- Triggers processing jobs
By default, the pipeline looks for FASTQ files in ./data/illumina_workflow/input
. You can configure this in the asset configuration.
The pipeline runs Bactopia with Docker profile. The command can be customized in the run_bactopia
function in illumina_workflow.py
.
Sample metadata is stored in DuckDB at ./data/database/seqsample.duckdb
. The database tracks:
- Sample names and file paths
- Species and genome size information
- Processing status
- Metadata
Add new dependencies to pixi.toml
:
[dependencies]
new-package = ">=1.0.0"
Then run:
pixi install
pytest ubio_autobox_tests
The project uses standard Python linting. Run checks with:
pixi run lint # if configured
-
Copy FASTQ files to the input directory:
cp /path/to/sample_R1.fastq.gz /path/to/sample_R2.fastq.gz data/illumina_workflow/input/
-
The sensor will automatically detect and process them within the sensor interval
-
Monitor progress in the Dagster UI at http://localhost:3000
You can also manually trigger processing:
- In the Dagster UI, go to the "Assets" tab
- Materialize
illumina_samples_in_folder
to discover new samples - Materialize
new_illumina_samples
to update the database - Use the "Jobs" tab to run individual sample processing jobs
- Bactopia not found: Ensure Bactopia is installed and accessible in the Pixi environment
- Docker issues: Make sure Docker is running for Bactopia execution
- Permission errors: Check that the pipeline has write access to output directories
- Database errors: Ensure DuckDB database directory exists and is writable
Check Dagster logs in the UI or run with verbose logging:
dagster dev --log-level DEBUG
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.