Skip to content

FrancisCrickInstitute/asf-tools

Repository files navigation

asf-tools

A comprehensive Python toolkit for Advanced Sequencing Facility (ASF) operations at the Francis Crick Institute.

Table of Contents

Overview

ASF Tools is a Python-based command-line application designed to streamline and automate repetitive tasks within ASF operations. It provides a comprehensive suite of utilities for:

  • Sequencing Data Management: Processing and organizing Illumina and Oxford Nanopore (ONT) sequencing data
  • LIMS Integration: Interfacing with Clarity LIMS for sample metadata and barcode information
  • Pipeline Automation: Creating and managing Nextflow pipeline runs for demultiplexing and analysis
  • Data Delivery: Automated symlink creation and data delivery to researchers
  • Infrastructure Management: SLURM job monitoring and SSH-based operations on Nemo

Authors: Chris Cheshire, Areda Elezi
Repository: github.com/FrancisCrickInstitute/asf-tools


User Guide

Production Usage

ASF Tools is deployed as a containerized application on Nemo. The recommended approach for production use is via the automation scripts in asf-automation-scripts.

Running via Automation Scripts

All operations must be run from the scripts folder where the config.sh file is located:

cd asf-automation-scripts/scripts
./asf_tools.sh [COMMAND] [OPTIONS]

Direct CLI Usage

For development or direct access:

# Activate environment
. .venv/bin/activate && uv sync --group dev

# Run commands
asf-tools pipeline [COMMAND] [OPTIONS]

CLI Commands

All pipeline commands are accessed via the pipeline subcommand:

asf-tools pipeline [COMMAND] [OPTIONS]

Data Pipeline Management

gen-demux-run

Creates run directories and SLURM batch scripts for demultiplexing pipelines. Supports both ONT and Illumina modes.

asf-tools pipeline gen-demux-run \
  --source_dir /path/to/raw/data \
  --target_dir /path/to/pipeline/runs \
  --mode ont \
  --pipeline_dir /path/to/nextflow/pipeline \
  --nextflow_cache /path/to/nf/cache \
  --nextflow_work /path/to/nf/work \
  --container_cache /path/to/singularity/cache \
  --runs_dir /host/path/to/runs

Required Options:

  • --source_dir: Directory containing raw sequencing data
  • --target_dir: Directory where pipeline runs will be created
  • --mode: Data type (ont, illumina, or general)
  • --pipeline_dir: Path to Nextflow pipeline code
  • --nextflow_cache: Nextflow cache directory
  • --nextflow_work: Nextflow work directory
  • --container_cache: Singularity container cache directory
  • --runs_dir: Host path for runs folder (for containerized environments)

Optional Flags:

  • --use_api: Generate samplesheets using Clarity LIMS API
  • --contains TEXT: Filter runs by substring in folder name
  • --samplesheet_only: Only update samplesheets, don't create new runs
  • --nextflow_version VERSION: Override default Nextflow version in SLURM header

Example - ONT demultiplexing with LIMS integration:

asf-tools pipeline gen-demux-run \
  --source_dir /data/ont/raw \
  --target_dir /data/ont/demux \
  --mode ont \
  --pipeline_dir /pipelines/nanopore_demux \
  --nextflow_cache /cache/nextflow \
  --nextflow_work /work/nextflow \
  --container_cache /cache/singularity \
  --runs_dir /mnt/data/runs \
  --use_api \
  --contains "PAK"

deliver-to-targets

Creates symlinks to deliver processed data to researcher directories.

asf-tools pipeline deliver-to-targets \
  --source_dir /path/to/processed/data \
  --target_dir /path/to/delivery/area

Required Options:

  • --source_dir: Source directory (run directory for non-interactive, parent directory for interactive)
  • --target_dir: Target delivery directory

Optional Options:

  • --host_delivery_folder: Host path for delivery when running in container
  • --interactive: Run in interactive mode to manually select runs

Example - Interactive delivery:

asf-tools pipeline deliver-to-targets \
  --source_dir /data/ont/demux \
  --target_dir /delivery/ont \
  --interactive

scan-run-state

Monitors the status of sequencing and pipeline runs, checking completion states and SLURM job status.

asf-tools pipeline scan-run-state \
  --raw_dir /path/to/raw/data \
  --run_dir /path/to/pipeline/runs \
  --target_dir /path/to/delivery/area \
  --mode ont

Required Options:

  • --raw_dir: Directory containing raw sequencing data
  • --run_dir: Directory containing pipeline runs
  • --target_dir: Data delivery directory
  • --mode: Data type (ont, illumina, or general)

Optional Options:

  • --slurm_user: SLURM username for job status checking
  • --job_prefix: SLURM job name prefix for filtering
  • --slurm_file: Path to SLURM job output file

Samplesheet Generation

gen-viral-genomics-samplesheet

Generates samplesheets for viral genomics pipelines from FASTQ file directories.

asf-tools pipeline gen-viral-genomics-samplesheet \
  --source_dir /path/to/fastq/files \
  --target_dir /path/to/output \
  --curr-prefix /old/path/prefix \
  --new-prefix /new/path/prefix

Required Options:

  • --source_dir: Directory containing FASTQ files
  • --target_dir: Directory to write the samplesheet

Optional Options:

  • --curr-prefix: Current path prefix to replace in FASTQ file paths
  • --new-prefix: New path prefix to substitute

Behavior:

  • Creates CSV samplesheet with sample metadata
  • Each (sample_id, lane) pair becomes a row
  • Automatically detects paired-end reads
  • Sorts output by sample ID and read paths for consistency

Data Upload

upload-report

Uploads analysis reports and metadata to database tables.

asf-tools pipeline upload-report \
  --data-file /path/to/report.pkl \
  --run-id RUN123 \
  --report-type quality_metrics \
  --upload-table reports_table

Required Options:

  • --data-file: Path to pickle file containing report data
  • --run-id: Unique run identifier
  • --report-type: Type of report being uploaded
  • --upload-table: Target database table

Optional Options:

  • --table_override: Override default table suffix

Developer Guide

Installation & Setup

Requirements

  • Python: 3.13+ (managed via asdf or pyenv)
  • UV: For fast dependency management
  • Just: For task automation
  • Operating System: Linux or macOS

Quick Setup

# Clone the repository
git clone https://github.com/FrancisCrickInstitute/asf-tools.git
cd asf-tools

# Set up development environment (creates .venv automatically)
just dev

The just dev command will:

  1. Create a .venv virtual environment if it doesn't exist
  2. Install all dependencies including development tools
  3. Activate the environment and spawn a new shell

Manual Setup

# Create virtual environment
uv venv .venv
source .venv/bin/activate

# Install dependencies
uv sync --group dev

# Verify installation
python -c "import asf_tools; print('Installation successful')"

Available Just Commands

just dev          # Set up development environment
just test         # Run pytest suite
just test-cli     # Run tests with CLI output
just lint         # Run ruff linting
just python-upgrade  # Upgrade Python version

Development Workflow

Test-Driven Development

This project follows strict TDD practices:

  1. Write tests first - Before implementing any feature
  2. Run tests frequently - Use just test after each change
  3. Maintain 100% coverage - All new code must be tested
  4. Use descriptive test names - Tests should document behavior

Code Quality Standards

Formatting & Linting:

# Format code
black .
isort .

# Check linting
ruff check .

# All checks
just lint

Testing:

# Run all tests
pytest

# Run with coverage
pytest --cov=asf_tools

# Run specific test file
pytest tests/test_specific_module.py

# Run tests with CLI output
just test-cli

Pre-commit Checklist:

  • All tests pass (just test)
  • Code is formatted (black ., isort .)
  • No linting errors (just lint)
  • New functionality has tests
  • Documentation updated if needed

Architecture

Module Overview

asf_tools/
├── api/                 # External API integrations
│   └── clarity/         # Clarity LIMS interface
├── config/              # Configuration management
├── database/            # Database models and operations
├── illumina/            # Illumina-specific data processing
├── io/                  # File I/O and data management
├── nextflow/            # Nextflow pipeline generation
├── slack/               # Slack webhook notifications
├── slurm/               # SLURM job management
└── ssh/                 # SSH connections and remote operations

Core Components

API Layer (asf_tools.api)

Clarity LIMS Integration:

  • clarity_lims.py: Direct API client for Clarity LIMS
  • clarity_helper_lims.py: High-level wrapper with domain logic
  • models.py: Pydantic models for data validation

Key functions:

  • Sample metadata retrieval
  • Barcode information extraction
  • Project and user information
  • Custom field parsing

Data Management (asf_tools.io)

Storage Interface:

  • Abstraction layer for local and remote file operations
  • Supports SSH-based remote operations
  • Handles permissions and directory creation

Data Management:

  • Pipeline state monitoring
  • Run completion detection
  • Data delivery automation
  • Directory cleanup utilities

Pipeline Integration (asf_tools.nextflow)

Pipeline Generators:

  • gen_ont_demux_run.py: ONT demultiplexing pipeline setup
  • gen_illumina_demux_run.py: Illumina demultiplexing pipeline setup
  • gen_viral_genomics_run.py: Viral genomics samplesheet generation

Features:

  • SLURM batch script generation
  • Nextflow parameter management
  • Container cache handling
  • Module loading for HPC environments

Infrastructure (asf_tools.slurm, asf_tools.ssh)

SLURM Integration:

  • Job status monitoring
  • Queue management
  • Resource allocation

SSH Operations:

  • Remote file operations on Nemo
  • Secure file transfer
  • Remote command execution

Data Flow

  1. Raw Data Ingestion: Monitor sequencing instrument output
  2. Metadata Retrieval: Query Clarity LIMS for sample information
  3. Pipeline Setup: Generate Nextflow configurations and SLURM scripts
  4. Execution Monitoring: Track pipeline progress and job status
  5. Data Delivery: Create symlinks and deliver results to researchers
  6. Cleanup: Archive and cleanup temporary files

Configuration Management

Configuration is managed through:

  • Environment variables
  • TOML configuration files (asf_tools.config.toml_loader)
  • Command-line arguments with Click
  • Container environment setup

Testing

Test Organization

Tests are organized in a flat structure mirroring the source code:

tests/
├── test_api_clarity_lims.py           # API functionality
├── test_io_data_management.py         # Data management
├── test_nextflow_gen_ont_demux_run.py # Pipeline generation
└── ...

Testing Guidelines

Test Structure:

class TestModuleName:
    def setup_method(self):
        # Test setup
        
    def test_specific_functionality(self):
        # Setup
        # Test
        # Assert using assert_that()

Assertions: Use assertpy for readable assertions:

from assertpy import assert_that

assert_that(result).is_equal_to(expected)
assert_that(file_path).exists()
assert_that(response.status_code).is_equal_to(200)

Mocking:

# Use pytest fixtures and mocks
@pytest.fixture
def mock_api():
    return ClarityHelperLimsMock()

def test_with_mock(mock_api):
    result = process_data(mock_api)
    assert_that(result).is_not_none()

Running Tests

# All tests
pytest

# Specific module
pytest tests/test_io_data_management.py

# With coverage
pytest --cov=asf_tools --cov-report=html

# Verbose output
pytest -v

# Stop on first failure
pytest -x

Project Structure

asf-tools/
├── asf_tools/              # Main source code
│   ├── __main__.py         # CLI entry point
│   ├── api/                # External integrations
│   ├── config/             # Configuration
│   ├── database/           # Database operations
│   ├── illumina/           # Illumina processing
│   ├── io/                 # I/O operations
│   ├── nextflow/           # Pipeline generation
│   ├── slack/              # Notifications
│   ├── slurm/              # HPC integration
│   └── ssh/                # Remote operations
├── tests/                  # Test suite (flat structure)
├── docs/                   # Documentation
├── output/                 # Runtime output (gitignored)
├── pyproject.toml          # Project configuration
├── justfile               # Task automation
├── Dockerfile             # Container build
├── pytest.ini            # Test configuration
├── uv.lock               # Dependency lock file
└── README.md             # This file

Key Files

  • pyproject.toml: Project metadata, dependencies, and tool configuration
  • justfile: Development task automation (replaces Makefile)
  • asf_tools/__main__.py: CLI entry point with Click framework
  • tests/: Comprehensive test suite with pytest
  • Dockerfile: Production container image

License

See LICENSE for details.

Contact

Maintainers:

Issues & Contributions: Please use the GitHub repository for bug reports, feature requests, and contributions.

About

ASF Tools command line tool-kit for ASF operations.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages