ubio_autobox

An automated bioinformatics pipeline for processing Illumina sequencing data using Bactopia. This project automatically detects new samples in an input directory, tracks their processing status, and runs Bactopia analysis on unprocessed samples using Dagster orchestration.

Overview

The pipeline works by:

Sample Discovery: Scanning the input directory for new FASTQ files
Sample Tracking: Storing sample metadata in a DuckDB database
Dynamic Processing: Creating dynamic partitions for each unprocessed sample
Automated Analysis: Running Bactopia on each sample via a sensor-triggered job
Progress Monitoring: Providing visual reports of processing status

Tech Stack

Bactopia: Bacterial genome analysis pipeline
Dagster: Data orchestration platform
DuckDB: Embedded analytical database
Pixi: Package and environment management

Project Structure

ubio_autobox/
├── data/
│   ├── database/           # DuckDB databases
│   └── illumina_workflow/
│       ├── input/          # Place FASTQ files here
│       └── output/         # Bactopia results
├── ubio_autobox/
│   ├── assets/
│   │   └── illumina_workflow.py  # Main pipeline assets
│   └── definitions.py      # Dagster definitions
├── pixi.toml              # Environment and dependencies
└── pyproject.toml         # Python project configuration

Getting Started

Prerequisites

Pixi for environment management
Docker (for Bactopia execution)

Installation

Clone the repository:

git clone https://github.com/ssi-dk/ubio_autobox/
cd ubio_autobox

Set up the environment with Pixi:
```
pixi install
```
Activate the Pixi environment:
```
pixi shell
```
Install the package in development mode:
```
pip install -e ".[dev]"
```

Running the Pipeline

Start the Dagster UI:
```
dagster dev
```
Access the web interface: Open http://localhost:3000 in your browser
Add FASTQ files: Place your Illumina FASTQ files (R1/R2 pairs) in data/illumina_workflow/input/
Monitor processing: The sensor will automatically detect new samples and create processing jobs

Pipeline Assets

Core Assets

illumina_samples_in_folder: Discovers FASTQ files using bactopia-prepare
new_illumina_samples: Identifies new samples by comparing with database
unprocessed_illumina_samples: Retrieves samples that haven't been processed
run_unprocessed_illumina_sample: Processes individual samples (partitioned)
illumina_samples_plot: Generates processing status reports

Dynamic Partitioning

The pipeline uses dynamic partitions to process each sample independently:

Each unprocessed sample gets its own partition
Samples can be processed in parallel
Failed samples don't block others

Sensor

The unprocessed_illumina_samples_sensor automatically:

Detects new unprocessed samples
Creates dynamic partitions
Triggers processing jobs

Configuration

Input Directory

By default, the pipeline looks for FASTQ files in ./data/illumina_workflow/input. You can configure this in the asset configuration.

Bactopia Settings

The pipeline runs Bactopia with Docker profile. The command can be customized in the run_bactopia function in illumina_workflow.py.

Database

Sample metadata is stored in DuckDB at ./data/database/seqsample.duckdb. The database tracks:

Sample names and file paths
Species and genome size information
Processing status
Metadata

Development

Adding Dependencies

Add new dependencies to pixi.toml:

[dependencies]
new-package = ">=1.0.0"

Then run:

pixi install

Running Tests

pytest ubio_autobox_tests

Code Quality

The project uses standard Python linting. Run checks with:

pixi run lint  # if configured

Usage Examples

Processing New Samples

Copy FASTQ files to the input directory:

cp /path/to/sample_R1.fastq.gz /path/to/sample_R2.fastq.gz data/illumina_workflow/input/

The sensor will automatically detect and process them within the sensor interval
Monitor progress in the Dagster UI at http://localhost:3000

Manual Processing

You can also manually trigger processing:

In the Dagster UI, go to the "Assets" tab
Materialize illumina_samples_in_folder to discover new samples
Materialize new_illumina_samples to update the database
Use the "Jobs" tab to run individual sample processing jobs

Troubleshooting

Common Issues

Bactopia not found: Ensure Bactopia is installed and accessible in the Pixi environment
Docker issues: Make sure Docker is running for Bactopia execution
Permission errors: Check that the pipeline has write access to output directories
Database errors: Ensure DuckDB database directory exists and is writable

Logs

Check Dagster logs in the UI or run with verbose logging:

dagster dev --log-level DEBUG

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
ubio_autobox		ubio_autobox
ubio_autobox_tests		ubio_autobox_tests
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pixi.lock		pixi.lock
pixi.toml		pixi.toml
pyproject.toml		pyproject.toml
ruff.toml		ruff.toml
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ubio_autobox

Overview

Tech Stack

Project Structure

Getting Started

Prerequisites

Installation

Running the Pipeline

Pipeline Assets

Core Assets

Dynamic Partitioning

Sensor

Configuration

Input Directory

Bactopia Settings

Database

Development

Adding Dependencies

Running Tests

Code Quality

Usage Examples

Processing New Samples

Manual Processing

Troubleshooting

Common Issues

Logs

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

ssi-dk/ubio_autobox

Folders and files

Latest commit

History

Repository files navigation

ubio_autobox

Overview

Tech Stack

Project Structure

Getting Started

Prerequisites

Installation

Running the Pipeline

Pipeline Assets

Core Assets

Dynamic Partitioning

Sensor

Configuration

Input Directory

Bactopia Settings

Database

Development

Adding Dependencies

Running Tests

Code Quality

Usage Examples

Processing New Samples

Manual Processing

Troubleshooting

Common Issues

Logs

Contributing

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages