A Python wrapper for GNU parallel that naturally embeds shell code into Python scripts or Jupyter notebooks (no delimiter hell). Features parameter substitution, environment substitution, and cross-product generation to eliminate shell loops. Ideal for seamlessly integrating third-party Unix programs into Python environments for bioinformatics and data science pipelines.
Author: C. Bryan Daniels (quendor_at_nandor.net)
Before - Python script depending upon Juptyer magic !-command
or Python subprocess
function to execute shell or Unix code
%%time
in_paths = ['dedup_MT', 'dedup_2', 'dedup_human']
out_paths = ['conv_unconv3n_MT', 'conv_unconv3n_2', 'conv_unconv3n_human']
samples = ['E1', 'E2', 'E3', 'Z1', 'Z2', 'Z3', 'U1', 'U2', 'U3']
threads = 6
for in_path, out_path in zip(in_paths, out_paths):
ref = str(in_path).split('_')[-1]
for sample in samples:
!samtools view -e "rlen<100000" -h {fname(in_path,sample,'bam')} |\
hisat-3n-table -p {threads}--unique-only --alignments - --ref {get_ref(ref,'fa')} --output-name /dev/stdout --base-change C,T|\
bgzip -@ {nc} -c > {fname(out_path,sample,'tsv.gz')}
CPU times: user 1min 22s, sys: 20.5 s, total: 1min 43s
Wall time: 4h 33min 14s
After - parallel_zip
using native shell code, portable to Jupyter or Python with parallelism built-in for speed
%%time
from parallel_zip import parallel_zip, Cross
parallel_zip(
"""
samtools view -e "rlen<100000" -h {in_path}/{sample}.bam | \
hisat-3n-table -p 6 --unique-only --alignments - --ref {get_ref(ref,'fa')} --output-name /dev/stdout --base-change C,T | \
bgzip -@ 6 -c > {out_path}/{sample}.tsv.gz
""",
in_paths = ['dedup_MT', 'dedup_2', 'dedup_human'] ,
out_paths= ['conv_unconv3n_MT', 'conv_unconv3n_2', 'conv_unconv3n_human']
cross=Cross(sample=['E1', 'E2', 'E3', 'Z1', 'Z2', 'Z3', 'U1', 'U2', 'U3']))
CPU times: user 103 ms, sys: 12.4 ms, total: 116 ms
Wall time: 41min 27s
What is executed - using dry_run=True
parallel_zip(
"""
samtools view -e "rlen<100000" -h {in_path}/{sample}.bam | \
hisat-3n-table -p 6 --unique-only --alignments - --ref {get_ref(ref,'fa')} --output-name /dev/stdout --base-change C,T | \
bgzip -@ 6 -c > {out_path}/{sample}.tsv.gz
""",
in_paths = ['dedup_MT', 'dedup_2', 'dedup_human'] ,
out_paths= ['conv_unconv3n_MT', 'conv_unconv3n_2', 'conv_unconv3n_human']
cross=Cross(sample=['E1', 'E2', 'E3', 'Z1', 'Z2', 'Z3', 'U1', 'U2', 'U3']),
dry_run=True)
samtools view -e "rlen<100000" -h dedup_MT/E1.bam | hisat-3n-table -p 6 --unique-only --alignments - --ref ../../reference/fasta/human.fa --output-name /dev/stdout --base-change C,T | bgzip -@ 6 -c > conv_unconv3n_MT/E1.tsv.gz
samtools view -e "rlen<100000" -h dedup_MT/E2.bam | hisat-3n-table -p 6 --unique-only --alignments - --ref ../../reference/fasta/human.fa --output-name /dev/stdout --base-change C,T | bgzip -@ 6 -c > conv_unconv3n_MT/E2.tsv.gz
samtools view -e "rlen<100000" -h dedup_MT/E3.bam | hisat-3n-table -p 6 --unique-only --alignments - --ref ../../reference/fasta/human.fa --output-name /dev/stdout --base-change C,T | bgzip -@ 6 -c > conv_unconv3n_MT/E3.tsv.gz
.
. . . [27 Total] . . .
.
samtools view -e "rlen<100000" -h dedup_human/U3.bam | hisat-3n-table -p 6 --unique-only --alignments - --ref ../../reference/fasta/human.fa --output-name /dev/stdout --base-change C,T | bgzip -@ 6 -c > conv_unconv3n_human/U3.tsv.gz
Quick shell commands - with the pz()
function:
from parallel_zip import pz
# Get file sizes
pz("ls -la *.txt")
# Returns: ['total 48', '-rw-r--r-- 1 user staff 156 Jan 15 data1.txt', ...]
# Quick data inspection
pz("head -3 data.csv")
# Returns: ['id,name,value', '1,apple,3.5', '2,banana,2.7']
# Complex shell command with natural shell syntax (no delimiting hell)
pz("""ls -l {os.getcwd()} | cut -f1 -d' ' | sed s/--// | sed s/^-// | awk '$1 ~ /x/ {split($1, parts, "-"); print parts[1]}' """)
# Returns: ['drwxrwxr', 'drwxrwxr', 'drwxrwxr', 'drwxrwxr']
- Easy shell command execution from Python: Run shell commands with near-native syntax from within Python
- True Python integration: Pass Python expressions, variables, and environment directly into command templates
- Multi-line bash script support: Write complex workflows as readable templates with automatic line-by-line execution
- Built-in parallelism: Inherits GNU parallel's battle-tested performance, load balancing, and resource management
- One-liner shell commands: Execute any shell command and get clean results with
pz("command")
- Automatic output formatting: Returns lists of lines by default, or raw strings when needed
- Useful when loops and passing variables are not required: Perfect for simple commands and quick data inspection
- No ceremony required: Direct shell execution without parameter setup or parallel processing overhead
- No more nested loops: Transform complex for-loop combinations into single, readable function calls
- Clean parameter substitution: Use intuitive
{param}
syntax instead of error-prone string concatenation - Eliminate boilerplate: Skip tedious setup code for parallel execution and parameter management
- Automatic parameter combinations: Generate all combinations of multiple parameter sets
- No manual combinatorics: Skip the nested loop math - just specify what parameters to vary
- Broadcasting magic: Single values automatically expand across parameter lists when needed
- Dry run everything: See exactly what commands will execute before running them
- Flexible output control: Get results as strings, lists, or run silently as your workflow needs
- Interactive-friendly: Perfect for Jupyter notebooks, iterative analysis, and experimental workflows
- Immediate feedback: Quick
pz()
commands for one-off tasks and data exploration
- Shell quoting protection: Proper handling of
$
, quotes, and special characters that are complicated within Python - Smart error handling: Choose strict mode for critical pipelines or continue-on-failure for exploratory work
- Python expression evaluation: Embed calculations and string manipulation directly in command templates
- vs. Raw GNU parallel: No complex option syntax, confusing escape sequences, or command-line juggling
- vs. Python multiprocessing: Purpose-built for shell commands with parameter substitution, not generic Python functions
- vs. Shell scripting: Get parameter combinations and parallel execution without variable explosion or syntax nightmares
Perfect for bioinformatics pipelines, data processing workflows, file operations, and any scenario where you need the power of shell tools with the convenience of Python.
GNU parallel is required and must be installed on your system:
Linux (Ubuntu/Debian)
sudo apt-get install parallel
Linux (CentOS/RHEL/Fedora)
sudo yum install parallel # CentOS/RHEL
sudo dnf install parallel # Fedora
macOS
brew install parallel
This is a single-file module. The simplest way to install:
# Clone and copy the module to your project
git clone https://github.com/prairie-guy/parallel_zip.git
cp parallel_zip/parallel_zip.py /path/to/your/project/
Optional: Install Locally with pip
git clone https://github.com/prairie-guy/parallel_zip.git
cd parallel_zip
pip install .
from parallel_zip import parallel_zip
# Process multiple files in parallel
parallel_zip("""wc -l sample_data/{file}""",
file=["data1.txt", "data2.txt", "data3.txt"],
verbose=True, lines=True)
# Returns:
['2 sample_data/data1.txt',
'14 sample_data/data2.txt',
'45 sample_data/data3.txt']
# Same files, different parameters - extract different numbers of lines
parallel_zip("""head -{num_lines} sample_data/{file}""",
file=["data1.txt", "data2.txt"],
num_lines=[2, 5],
verbose=True, lines=True)
# Returns:
['Sample data line one',
'This is line two',
'Line 1: Introduction',
'Line 2: Methods overview',
'Line 3: Data collection started',
'Line 4: Quality control passed',
'Line 5: Processing pipeline initialized ']
from parallel_zip import parallel_zip, Cross
# Test every file with every search pattern - 6 total combinations
parallel_zip("""grep -c '{pattern}' sample_data/{file}""",
file=["data1.txt", "data2.txt"],
cross=Cross(pattern=["line", "data", "Line"]),
verbose=True, lines=True)
# Returns: ['3', '1', '0', '1', '0', '15']
from parallel_zip import pz
# One-liner for immediate results
pz("ls sample_data")
# Returns:
['data1.txt',
'data2.txt',
'data3.txt',
'data.csv',
'numbers.txt',
'sample1.txt',
'sample2.txt',
'server_logs.txt']
pz("wc -w sample_data/*.txt")
# Returns:
[' 11 sample_data/data1.txt',
' 70 sample_data/data2.txt',
' 148 sample_data/data3.txt',
' 8 sample_data/numbers.txt',
' 16 sample_data/sample1.txt',
' 16 sample_data/sample2.txt',
' 31 sample_data/server_logs.txt',
' 300 total']
pz("pwd")
# Returns: ['/home/quendor/stuff/parallel_zip']
The heart of parallel_zip
is intuitive parameter substitution using {parameter}
syntax in command templates.
Warning: The following parameter names are reserved and should not be used as named parameters to parallel_zip
: command
, cross
, verbose
, lines
, dry_run
, strict
, java_memory
. This is a known issue that will be fixed in a future version.
# Single parameter
parallel_zip("""cat sample_data/{filename}""", filename="data1.txt", verbose=True, lines=True)
# Returns: ['Sample data line one', 'This is line two', 'Final line three']
# Multiple parameters
parallel_zip("""head -{num} sample_data/{file} | tail -{last}""",
file="data2.txt", num=10, last=3, verbose=True, lines=True)
# Returns:
['Line 8: Statistical analysis begun',
'Line 9: Results compilation phase',
'Line 10: Visualization generated']
# Parameters are zipped together (like Python's zip function)
parallel_zip("""echo 'File {file} has {line_count} lines'""",
file=["data1.txt", "data2.txt", "data3.txt"],
line_count=[3, 15, 42], verbose=True, lines=True)
# Returns:
['File data1.txt has 3 lines',
'File data2.txt has 15 lines',
'File data3.txt has 42 lines']
# Real file processing - multi-step workflow
parallel_zip("""
wc -w sample_data/{input} > sample_data/{output}
cat sample_data/{output}
rm sample_data/{output}
""",
input=["data1.txt", "data2.txt"],
output=["count1.txt", "count2.txt"],
verbose=True, lines=True)
# Returns: ['11 sample_data/data1.txt', '70 sample_data/data2.txt']
# Single values automatically broadcast to match list length
parallel_zip("""grep '{pattern}' sample_data/{file}""",
file=["data1.txt", "data2.txt", "server_logs.txt"],
pattern="line", dry_run=True)
# Returns:
["grep 'line' sample_data/data1.txt",
"grep 'line' sample_data/data2.txt",
"grep 'line' sample_data/server_logs.txt"]
# Embed Python expressions directly in commands
parallel_zip("""echo 'File {file} - uppercase: {file.upper()}'""",
file=["data1.txt", "sample1.txt"], verbose=True, lines=True)
# Returns:
['File data1.txt - uppercase: DATA1.TXT',
'File sample1.txt - uppercase: SAMPLE1.TXT']
# Mathematical operations
parallel_zip("""echo 'Number {num} doubled is {int(num) * 2}'""",
num=[5, 10, 15], verbose=True, lines=True)
# Returns:
['Number 5 doubled is 10',
'Number 10 doubled is 20',
'Number 15 doubled is 30']
# Access Python environment
import os
parallel_zip("""echo 'Working in {os.getcwd()}/sample_data'""", verbose=True, lines=True)
# Returns: ['Working in /home/quendor/stuff/parallel_zip/sample_data']
When you need literal curly braces in your output (for JSON, shell scripts, or code generation), use {{ }}
to prevent them from being interpreted as parameter placeholders.
# Generate valid JSON with literal braces
parallel_zip("""echo '{file}: {{"{key}": "{value}"}}' > {file}.json
cat {file}.json
rm {file}.json""",
file=["config", "settings", "params"],
key=["version", "mode", "level"],
value=["1.0", "production", "debug"],
verbose=True, lines=True)
# Returns:
['config: {"version": "1.0"}',
'settings: {"mode": "production"}',
'params: {"level": "debug"}']
# Generate shell scripts with literal braces for command grouping
parallel_zip("""echo 'if [ -f {file} ]; then {{ echo "Found {file}"; process_{action}; }}' > check_{file}.sh
echo '> cat' check_{file}.sh && cat check_{file}.sh
rm check_{file}.sh
""",
file=["data1.txt", "data2.txt"],
action=["validate", "transform"],
verbose=True, lines=True)
# Returns:
['> cat check_data1.txt.sh',
'if [ -f data1.txt ]; then { echo "Found data1.txt"; process_validate; }',
'> cat check_data2.txt.sh',
'if [ -f data2.txt ]; then { echo "Found data2.txt"; process_transform; }']
Key Points:
{{
becomes{
in the output}}
becomes}
in the output- Parameters like
{file}
inside{{ }}
are still substituted - Perfect for generating JSON, shell scripts, or any text that needs literal braces
Cross products let you run every combination of parameters automatically. Use the Cross()
helper for clean, readable syntax.
# Run every file with every pattern - 6 total combinations
parallel_zip("""grep -c '{pattern}' sample_data/{file}""",
file=["data1.txt", "data2.txt"],
cross=Cross(pattern=["line", "data", "INFO"]),
verbose=True, lines=True)
# Returns: ['3', '1', '0', '1', '0', '0']
# (line in data1, data in data1, INFO in data1, line in data2, data in data2, INFO in data2)
# Every file × every tool × every option = 8 combinations
parallel_zip("""echo 'Processing {file} with {tool} using {option}'""",
file=["data1.txt"],
cross=Cross(
tool=["grep", "awk"],
option=["fast", "thorough", "detailed", "quick"]
),
verbose=True, lines=True)
# Returns:
['Processing data1.txt with grep using fast',
'Processing data1.txt with grep using thorough',
'Processing data1.txt with grep using detailed',
'Processing data1.txt with grep using quick',
'Processing data1.txt with awk using fast',
'Processing data1.txt with awk using thorough',
'Processing data1.txt with awk using detailed',
'Processing data1.txt with awk using quick']
# Zipped parameters stay together, cross parameters expand
parallel_zip("""echo 'File {input} -> {output}, method: {method}, quality: {quality}'""",
input=["data1.txt", "data2.txt"], # Zipped together
output=["result1.txt", "result2.txt"], # Zipped together
cross=Cross(
method=["standard", "advanced"], # Cross product
quality=["low", "high"] # Cross product
),
verbose=True, lines=True)
# Returns: 2 file pairs × 2 methods × 2 qualities = 8 combinations
['File data1.txt -> result1.txt, method: standard, quality: low',
'File data1.txt -> result1.txt, method: standard, quality: high',
'File data1.txt -> result1.txt, method: advanced, quality: low',
'File data1.txt -> result1.txt, method: advanced, quality: high',
'File data2.txt -> result2.txt, method: standard, quality: low',
'File data2.txt -> result2.txt, method: standard, quality: high',
'File data2.txt -> result2.txt, method: advanced, quality: low',
'File data2.txt -> result2.txt, method: advanced, quality: high']
# These are equivalent - use Cross() for readability
# Cross() format (recommended)
parallel_zip("""echo '{tool} on {file}'""",
file=["data1.txt"],
cross=Cross(tool=["grep", "awk"]),
verbose=True, lines=True)
# Returns: ['grep on data1.txt', 'awk on data1.txt']
# List format (also works)
parallel_zip("""echo '{tool} on {file}'""",
file=["data1.txt"],
cross=[{"tool": ["grep", "awk"]}],
verbose=True, lines=True)
# Returns: ['grep on data1.txt', 'awk on data1.txt']
# Every combination across 3 parameter sets
parallel_zip("""echo 'Processing {file} with {tool} at {quality} quality and {mode} mode'""",
file=["data1.txt"],
cross=Cross(
tool=["grep", "awk"],
quality=["low", "high"],
mode=["fast", "thorough"]
),
dry_run=True)
# Returns: 8 combinations (2×2×2)
["echo 'Processing data1.txt with grep at low quality and fast mode'",
"echo 'Processing data1.txt with grep at low quality and thorough mode'",
"echo 'Processing data1.txt with grep at high quality and fast mode'",
"echo 'Processing data1.txt with grep at high quality and thorough mode'",
"echo 'Processing data1.txt with awk at low quality and fast mode'",
"echo 'Processing data1.txt with awk at low quality and thorough mode'",
"echo 'Processing data1.txt with awk at high quality and fast mode'",
"echo 'Processing data1.txt with awk at high quality and thorough mode'"]
When you need quick shell commands without parameter substitution or parallelization, pz()
is your friend. Perfect for data exploration, one-off commands, and simple shell operations.
from parallel_zip import pz
# Simple file listing
pz("ls sample_data")
# Returns:
['data1.txt', 'data2.txt', 'data3.txt', 'data.csv', 'numbers.txt', 'sample1.txt', 'sample2.txt', 'server_logs.txt']
# Get current directory
pz("pwd")
# Returns: ['/home/quendor/stuff/parallel_zip']
# Quick file inspection
pz("head -3 sample_data/data2.txt")
# Returns:
['Line 1: Introduction', 'Line 2: Methods overview', 'Line 3: Data collection started']
# Extract specific fields
pz("""awk '{print $1, $3}' sample_data/sample1.txt""")
# Returns:
['product_1 299.99', 'product_2 19.95', 'product_3 149.50', 'product_4 79.99']
# Count and sum - use single quotes to protect $ from shell expansion
pz("""awk '{sum += $3; count++} END {print "Total items:", count, "Sum:", sum}' sample_data/sample1.txt""")
# Returns: ['Total items: 4 Sum: 549.43']
# Pattern matching with AWK
pz("""awk '/product_[13]/ {print $2, $4}' sample_data/sample1.txt""")
# Returns: ['electronics in_stock', 'electronics in_stock']
# Complex shell pipeline in one line
pz("""sort sample_data/numbers.txt | head -3""")
# Returns: ['10', '18', '2']
# Multiple command chaining
pz("""cat sample_data/numbers.txt | sort -n | tail -3""")
# Returns: ['29', '33', '41']
# String manipulation
pz("""echo 'hello world' | tr 'a-z' 'A-Z'""")
# Returns: ['HELLO WORLD']
# Access Python environment in shell commands
import os
pz("""ls -la {os.getcwd()}/sample_data | head -5""")
# Returns:
['total 40', 'drwxrwxr-x 2 quendor quendor 4096 Jun 26 15:37 .', 'drwxrwxr-x 8 quendor quendor 4096 Jun 26 14:54 ..', '-rw-r--r-- 1 quendor quendor 54 Jun 26 14:55 data1.txt', '-rw-r--r-- 1 quendor quendor 501 Jun 26 14:55 data2.txt']
# Mathematical calculations
pz("""echo 'Result: {2 + 3 * 4}'""")
# Returns: ['Result: 14']
# Get output as list of lines (default)
result_lines = pz("""cat sample_data/data1.txt""")
# Returns: ['Sample data line one', 'This is line two', 'Final line three']
# Get output as single string
result_string = pz("""cat sample_data/data1.txt""", lines=False)
# Returns: 'Sample data line one\nThis is line two\nFinal line three'
# CSV analysis with AWK
pz("""awk -F',' 'NR>1 {print $2, $3}' sample_data/data.csv""")
# Returns:
['apple 3.5', 'banana 2.7', 'carrot 1.8', 'cherry 4.2', 'potato 2.1']
# Log file analysis
pz("""awk '/ERROR/ {print "ERROR at", $2 ":", substr($0, index($0,$4))}' sample_data/server_logs.txt""")
# Returns: ['ERROR at 09:00:32: File not found: config.xml']
# Field counting and statistics
pz("""awk '{print NF, $0}' sample_data/sample2.txt""")
# Returns:
['4 order_1001 2024-01-15 customer_a 450.00', '4 order_1002 2024-01-15 customer_b 125.75', '4 order_1003 2024-01-16 customer_c 299.99', '4 order_1004 2024-01-16 customer_a 89.50']
# Use pz() for simple, one-off commands
file_count = pz("""ls sample_data | wc -l""")
disk_usage = pz("""du -sh sample_data""")
current_user = pz("""whoami""")
# Use parallel_zip() for parameter substitution and parallelization
parallel_zip("""wc -l sample_data/{file}""",
file=["data1.txt", "data2.txt", "data3.txt"])
Key Points:
- Use
pz()
when you don't need parameter substitution or parallel execution - Perfect for quick data exploration and shell command testing
- Automatically returns clean output as list of lines
- Supports Python expressions for dynamic shell commands
- Great for AWK, sed, grep, and other text processing tools
Move beyond basic examples to see how parallel_zip
handles more compplex processing tasks.
# Check which files exist before processing
parallel_zip("""test -f sample_data/{file} && echo '{file} exists' || echo '{file} missing'""",
file=["data1.txt", "missing.txt", "data2.txt"],
verbose=True, lines=True)
# Returns:
['data1.txt exists', 'missing.txt missing', 'data2.txt exists']
# Process only existing files with error handling
parallel_zip("""[ -f sample_data/{file} ] && wc -l sample_data/{file} || echo 'SKIP: {file}'""",
file=["data1.txt", "missing.txt", "data3.txt"],
verbose=True, lines=True)
# Returns:
['2 sample_data/data1.txt', 'SKIP: missing.txt', '45 sample_data/data3.txt']
# Get file statistics
parallel_zip("""echo '{file}:' && wc -l sample_data/{file} && wc -c sample_data/{file}""",
file=["data1.txt", "data2.txt"],
verbose=True, lines=True)
# Returns:
['data1.txt:',
'3 sample_data/data1.txt',
'55 sample_data/data1.txt',
'data2.txt:',
'15 sample_data/data2.txt',
'502 sample_data/data2.txt']
# Find patterns across multiple files
parallel_zip("""echo 'File = {file}:' && grep -n '{pattern}' sample_data/{file} || echo 'No matches'""",
file=["data1.txt", "data2.txt", "server_logs.txt"],
cross=Cross(pattern=["line", "ERROR", "data"]),
verbose=True, lines=True)
# Returns:
['File = data1.txt:',
'1:Sample data line one',
'2:This is line two',
'3:Final line three',
'File = data1.txt:',
'No matches',
'File = data1.txt:',
'1:Sample data line one',
'File = data2.txt:',
'5:Line 5: Processing pipeline initialized ',
'File = data2.txt:',
'No matches',
'File = data2.txt:',
'No matches',
'File = server_logs.txt:',
'No matches',
'File = server_logs.txt:',
'3:2024-01-15 09:00:32 ERROR File not found: config.xml',
'File = server_logs.txt:',
'No matches']
# Extract and reformat CSV data
pz("""awk -F',' 'NR>1 {printf "%-10s %8.2f %s\\n", $2, $3, $4}' sample_data/data.csv""")
# Returns:
['apple 3.50 fruit',
'banana 2.70 fruit',
'carrot 1.80 vegetable',
'cherry 4.20 fruit',
'potato 2.10 vegetable']
# Calculate statistics across categories
pz("""awk -F',' 'NR>1 {sum[$4] += $3; count[$4]++} END {for(cat in sum) printf "%s: avg=%.2f (n=%d)\\n", cat, sum[cat]/count[cat], count[cat]}' sample_data/data.csv""")
# Returns:
['vegetable: avg=1.95 (n=2)', 'fruit: avg=3.47 (n=3)']
# Analyze server logs by time patterns
logs = pz("""awk '{print $2}' sample_data/server_logs.txt | sort | uniq -c""")
[l.strip() for l in logs]
# Returns:
['1 09:00:01', '1 09:00:15', '1 09:00:32', '1 09:01:05', '1 09:01:45']
# Extract errors with context
pz("""awk '/ERROR/ {print "ERROR at", $2 ":", substr($0, index($0,$4))}' sample_data/server_logs.txt""")
# Returns: ['ERROR at 09:00:32: File not found: config.xml']
# Generate summary report with single-line AWK
pz("""awk '{level = $3; count[level]++; if (level == "ERROR") errors[NR] = $0} END {print "=== LOG SUMMARY ==="; for (l in count) print l ": " count[l]; print "\\n=== ERROR DETAILS ==="; for (e in errors) print errors[e]}' sample_data/server_logs.txt""")
# Returns:
['=== LOG SUMMARY ===',
'WARNING: 1',
'ERROR: 1',
'INFO: 3',
'',
'=== ERROR DETAILS ===',
'2024-01-15 09:00:32 ERROR File not found: config.xml']
# Process different file types with appropriate tools
parallel_zip("""echo 'Processing = {file}:' && {processor} sample_data/{file}""",
file=["data.csv", "numbers.txt", "server_logs.txt"],
processor=["awk -F',' 'NR>1 {print $2, $3}'",
"sort -n",
"grep ERROR"],
verbose=True, lines=True)
# Returns:
['Processing = data.csv:',
'apple 3.5',
'banana 2.7',
'carrot 1.8',
'cherry 4.2',
'potato 2.1',
'Processing = numbers.txt:',
'2',
'7',
'10',
'18',
'25',
'29',
'33',
'41',
'Processing = server_logs.txt:',
'2024-01-15 09:00:32 ERROR File not found: config.xml']
# Cross-reference data between files
parallel_zip("""
echo "=== Analysis of {file} ==="
wc -l sample_data/{file}
echo "Top 3 lines:"
head -3 sample_data/{file}
echo "Contains 'data': $(grep -c data sample_data/{file} || echo 0)"
""",
file=["data1.txt", "data2.txt", "sample1.txt"],
verbose=True, lines=True)
# Returns:
['=== Analysis of data1.txt ===',
'3 sample_data/data1.txt',
'Top 3 lines:',
'Sample data line one',
'This is line two',
'Final line three',
"Contains 'data': 1",
'=== Analysis of data2.txt ===',
'15 sample_data/data2.txt',
'Top 3 lines:',
'Line 1: Introduction',
'Line 2: Methods overview',
'Line 3: Data collection started',
"Contains 'data': 0",
'0',
'=== Analysis of sample1.txt ===',
'4 sample_data/sample1.txt',
'Top 3 lines:',
'product_1 electronics 299.99 in_stock',
'product_2 books 19.95 out_of_stock',
'product_3 electronics 149.50 in_stock',
"Contains 'data': 0",
'0']
# Validate data quality across files
parallel_zip("""echo 'Quality check for {file}:' && awk 'END {print "Lines:", NR, "Fields per line:", NF}' sample_data/{file}""",
file=["data.csv", "sample1.txt", "sample2.txt"],
verbose=True, lines=True)
# Returns:
['Quality check for data.csv:',
'Lines: 6 Fields per line: 1',
'Quality check for sample1.txt:',
'Lines: 4 Fields per line: 4',
'Quality check for sample2.txt:',
'Lines: 4 Fields per line: 4']
# Compare processing approaches
parallel_zip("""echo '=== cat {file} | {filter.split(' ')[0]} | {extract.split(' ')[0]} ===' && cat sample_data/{file} | {filter} | {extract}""",
file=["data2.txt"],
cross=Cross(
filter=["head -3", "tail -2", "grep started"],
extract=["cut -d' ' -f2", "sed 's/^[^ ]* [^ ]* //'", "awk '{print $1, $2}'"]
),
verbose=True, lines=True)
# Returns:
['=== cat data2.txt | head | cut ===',
'1:',
'2:',
'3:',
'=== cat data2.txt | head | sed ===',
'Introduction',
'Methods overview',
'Data collection started',
'=== cat data2.txt | head | awk ===',
'Line 1:',
'Line 2:',
'Line 3:',
'=== cat data2.txt | tail | cut ===',
'14:',
'15:',
'=== cat data2.txt | tail | sed ===',
'Documentation updated',
'Analysis finished successfully',
'=== cat data2.txt | tail | awk ===',
'Line 14:',
'Line 15:',
'=== cat data2.txt | grep | cut ===',
'3:',
'13:',
'=== cat data2.txt | grep | sed ===',
'Data collection started',
'Report generation started',
'=== cat data2.txt | grep | awk ===',
'Line 3:',
'Line 13:']
Key Workflow Principles:
- AWK on single lines: Use semicolons (;) to separate AWK statements instead of newlines
- Self-contained commands: Each line in multi-line blocks must be a complete shell command
- Validation first: Check file existence and properties before processing
- Error handling: Use shell conditionals and
|| echo
for graceful failure handling - Quality control: Validate data structure and content before complex operations
- Performance awareness: Monitor command execution time and resource usage
Master the tools for previewing, controlling output, and handling errors in your parallel_zip
workflows.
Always preview commands before running them, especially with cross products or complex parameter combinations.
# See exactly what commands will be generated
parallel_zip("""echo 'Processing {file} with {tool}'""",
file=["data1.txt", "data2.txt"],
cross=Cross(tool=["grep", "awk"]),
dry_run=True)
# Returns:
["echo 'Processing data1.txt with grep'",
"echo 'Processing data1.txt with awk'",
"echo 'Processing data2.txt with grep'",
"echo 'Processing data2.txt with awk'"]
# Debug complex find operations
parallel_zip("""find sample_data -name '*{pattern}*' -type f""",
cross=Cross(pattern=["data", "sample", "server"]),
dry_run=True)
# Returns:
["find sample_data -name '*data*' -type f",
"find sample_data -name '*sample*' -type f",
"find sample_data -name '*server*' -type f"]
Choose the right output format for your workflow needs.
# Run commands without capturing output (fastest)
parallel_zip("""echo 'This runs silently for {file}'""",
file=["data1.txt", "data2.txt"],
verbose=False)
# Returns: None
# Get output as single string (good for single commands)
parallel_zip("""echo 'Line 1 for {file}'; echo 'Line 2 for {file}'""",
file=["data1.txt"],
verbose=True, lines=False)
# Returns: 'Line 1 for data1.txt\nLine 2 for data1.txt\n'
# Get output as list of lines (best for processing)
parallel_zip("""echo 'Line 1 for {file}'; echo 'Line 2 for {file}'""",
file=["data1.txt"],
verbose=True, lines=True)
# Returns: ['Line 1 for data1.txt', 'Line 2 for data1.txt']
Control how parallel_zip
responds to command failures.
# Handle expected failures gracefully
parallel_zip("""grep 'nonexistent_pattern' sample_data/{file} || echo 'No match in {file}'""",
file=["data1.txt", "data2.txt"],
strict=False, verbose=True, lines=True)
# Returns: ['No match in data1.txt', 'No match in data2.txt']
# Commands that naturally return non-zero (like grep with no matches)
parallel_zip("""grep 'NOTFOUND' sample_data/{file}""",
file=["data1.txt", "data2.txt"],
strict=False, verbose=True, lines=True)
# Returns: [] # Empty results, but no error
# Stop processing if any command fails
parallel_zip("""test -f sample_data/{file} && echo '{file} exists'""",
file=["data1.txt", "missing.txt", "data2.txt"],
strict=True, verbose=True, lines=True)
# Output:
# parallel_zip: error with return code 1
# Error details:
# test -f sample_data/data1.txt && echo 'data1.txt exists'
# test -f sample_data/missing.txt && echo 'missing.txt exists'
# test -f sample_data/data2.txt && echo 'data2.txt exists'
# Returns: None
# Handle mixed scenarios with proper shell logic
parallel_zip("""echo 'Testing {file}' && test -f sample_data/{file} && echo 'EXISTS' || echo 'MISSING'""",
file=["data1.txt", "missing.txt", "data2.txt"],
strict=False, verbose=True, lines=True)
# Returns:
['Testing data1.txt', 'EXISTS', 'Testing missing.txt', 'MISSING', 'Testing data2.txt', 'EXISTS']
# Step 1: Preview commands
commands = parallel_zip("""find sample_data -name '*{pattern}*' -type f""",
cross=Cross(pattern=["data", "sample", "server"]),
dry_run=True)
print("Will execute:", commands)
# Step 2: Execute after verification
result = parallel_zip("""find sample_data -name '*{pattern}*' -type f""",
cross=Cross(pattern=["data", "sample", "server"]),
verbose=True, lines=True)
# Returns:
['sample_data/data3.txt', 'sample_data/data1.txt', 'sample_data/data.csv', 'sample_data/data2.txt', 'sample_data/sample1.txt', 'sample_data/sample2.txt', 'sample_data/server_logs.txt']
# Add echo statements to track pipeline progress
parallel_zip("""echo 'Starting {operation}' && {cmd} sample_data/{file} && echo 'Completed {operation}'""",
file=["data1.txt", "numbers.txt"],
operation=["count_lines", "sort_content"],
cmd=["wc -l", "sort"],
verbose=True, lines=True)
# Returns:
['Starting count_lines', '2 sample_data/data1.txt', 'Completed count_lines', 'Starting sort_content', '10', '18', '2', '25', '29', '33', '41', '7', 'Completed sort_content']
# Debug what parameters will be substituted
parallel_zip("""echo 'File: {file}, Pattern: {pattern}, Command will be: grep {pattern} sample_data/{file}'""",
file=["data1.txt"],
cross=Cross(pattern=["line", "data"]),
dry_run=True)
# Returns:
["echo 'File: data1.txt, Pattern: line, Command will be: grep line sample_data/data1.txt'",
"echo 'File: data1.txt, Pattern: data, Command will be: grep data sample_data/data1.txt'"]
# pz() also supports strict mode
pz("""grep 'line' sample_data/data1.txt""", strict=False)
# Returns: ['Sample data line one', 'This is line two', 'Final line three']
pz("""grep 'NOTFOUND' sample_data/data1.txt""", strict=False)
# Returns: [] # No error, just empty results
- Always dry run first for complex parameter combinations
- Use verbose mode during development and testing
- Choose strict=False for exploratory work (grep, find, test commands)
- Choose strict=True for critical pipelines where failures should stop processing
- Add echo statements to track progress in multi-step workflows
- Test with small datasets first before scaling up
- Use shell error handling (
|| echo "fallback"
) for expected failures
Common Command Exit Codes:
grep
: Returns 1 when no matches found (not an error)test
/[
: Returns 1 when condition is false (not an error)find
: Returns 0 even when no files founddiff
: Returns 1 when files differ (not an error)
Push parallel_zip
to its limits with complex workflows, advanced parameter handling, and sophisticated command patterns.
Break complex workflows into readable, multi-step processes.
# Complex multi-step workflow
parallel_zip("""
echo "=== Processing {file} ==="
cp sample_data/{file} temp_{file}
wc -l temp_{file}
rm temp_{file}
echo "=== Completed {file} ==="
""",
file=["data1.txt", "data2.txt"],
verbose=True, lines=True)
# Returns:
['=== Processing data1.txt ===',
'3 temp_data1.txt',
'=== Completed data1.txt ===',
'=== Processing data2.txt ===',
'15 temp_data2.txt',
'=== Completed data2.txt ===']
Embed sophisticated Python logic directly in command templates.
# Advanced string operations
parallel_zip("""echo 'File: {file}, File Name Length: {len(file)}, Extension: {file.split(".")[-1]}'""",
file=["data1.txt", "sample2.txt", "server_logs.txt"],
verbose=True, lines=True)
# Returns:
['File: data1.txt, File Name Length: 9, Extension: txt',
'File: sample2.txt, File Name Length: 11, Extension: txt',
'File: server_logs.txt, File Name Length: 15, Extension: txt']
# String manipulation showcase
parallel_zip("""echo 'Original: {name}, Upper: {name.upper()}, Reversed: {name[::-1]}, First 3: {name[:3]}'""",
name=["alice", "bob", "charlie"],
verbose=True, lines=True)
# Returns:
['Original: alice, Upper: ALICE, Reversed: ecila, First 3: ali',
'Original: bob, Upper: BOB, Reversed: bob, First 3: bob',
'Original: charlie, Upper: CHARLIE, Reversed: eilrahc, First 3: cha']
# Complex calculations
parallel_zip("""echo 'Number {num}: squared={int(num)**2}, factorial={int(num)*int(num)-1 if int(num)>1 else 1}'""",
num=[3, 4, 5],
verbose=True, lines=True)
# Returns:
['Number 3: squared=9, factorial=8',
'Number 4: squared=16, factorial=15',
'Number 5: squared=25, factorial=24']
# Access Python environment and modules
from parallel_zip import paralel_zip
import os
import datetime
user = os.environ.get('USER', 'unknown')
parallel_zip("""echo 'User {user} processing {file} at {datetime.datetime.now().strftime("%H:%M:%S")}'""",
file=["data1.txt", "data2.txt"],
user=user,
verbose=True, lines=True)
# Returns:
['User quendor processing data1.txt at 16:16:45',
'User quendor processing data2.txt at 16:16:45']
Handle complex shell syntax safely and correctly.
# Use single quotes to protect AWK syntax - no double braces needed
parallel_zip("""awk '{print $1, $3}' sample_data/{file}""",
file=["sample1.txt", "sample2.txt"],
verbose=True, lines=True)
# Returns:
['product_1 299.99', 'product_2 19.95', 'product_3 149.50', 'product_4 79.99', 'order_1001 customer_a', 'order_1002 customer_b', 'order_1003 customer_c', 'order_1004 customer_a']
# CRITICAL: Use single quotes to protect $ from shell expansion
pz("""awk '{sum += $3; count++} END {print "Total items:", count, "Sum:", sum}' sample_data/sample1.txt""")
# Returns: ['Total items: 4 Sum: 549.43']
# WRONG - $ gets expanded by shell before AWK sees it:
# pz('awk "{sum += $3; count++} END {print \"Total:\", sum}" file.txt')
# Complex regex patterns with proper escaping
parallel_zip("""grep -E '{pattern}' sample_data/{file} || echo 'No matches'""",
file=["data1.txt", "server_logs.txt"],
cross=Cross(pattern=["^[A-Z]", "[0-9]+", "line$"]),
verbose=True, lines=True)
# Returns:
['Sample data line one',
'This is line two',
'Final line three',
'No matches',
'No matches',
'No matches',
'2024-01-15 09:00:01 INFO User login successful',
'2024-01-15 09:00:15 WARNING Database connection slow ',
'2024-01-15 09:00:32 ERROR File not found: config.xml',
'2024-01-15 09:01:05 INFO Backup process started',
'2024-01-15 09:01:45 INFO Backup process completed',
'No matches']
Advanced Best Practices:
- Multi-line Commands: Each line must be self-contained; use
&&
for dependencies - AWK and $: Always use single quotes:
awk '{print $1}'
notawk "{print $1}"
- Parameter Substitution: Single quotes protect AWK syntax while allowing
{file}
substitution - Python Expressions: Can access any Python module imported in the calling scope
- Cross Products: Scale exponentially; use
dry_run=True
to verify combinations - Shell Variables: Use shell assignment and conditionals within single-line commands
- Error Handling: Combine with
|| echo "fallback"
for robust pipelines
Avoid common pitfalls and optimize your parallel_zip
workflows with these proven patterns and solutions.
# WRONG - Using reserved parameter names
parallel_zip("""echo '{command}'""",
command="test", # 'command' is reserved!
dry_run=True)
# This will cause unexpected behavior
# CORRECT - Use different parameter names
parallel_zip("""echo '{cmd}'""",
cmd="test",
dry_run=True)
Reserved parameters: command
, cross
, verbose
, lines
, dry_run
, strict
, java_memory
# WRONG - Lists of different lengths
try:
parallel_zip("""echo '{input} -> {output}'""",
input=["a", "b", "c"], # Length 3
output=["x", "y"], # Length 2 - will fail
dry_run=True)
except ValueError as e:
print("Error:", e)
# Error: All named parameters must have the same length or be single values for broadcasting
# CORRECT - Use broadcasting or matching lengths
parallel_zip("""echo '{input} -> {output}'""",
input=["a", "b", "c"], # Length 3
output="default", # Single value broadcasts
dry_run=True)
# WRONG - Double quotes let shell expand $NF (becomes empty)
pz("""echo "Field count: $NF" """)
# Returns: ['Field count: ']
# CORRECT - Single quotes protect $ from shell expansion
pz("""echo 'Field count: $NF'""")
# Returns: ['Field count: $NF']
pz()
: Simple commands, one-off operations, data explorationparallel_zip()
: Parameter substitution, multiple files, cross products- Raw shell: When neither adds value
- Zipped parameters: Related values that should stay together
- Broadcast parameters: Common values applied to all iterations
- Cross parameters: Every combination needed for testing/analysis
- Each line self-contained: No multi-line shell constructs
- Use semicolons: Separate shell statements on the same line
- Single quotes for AWK/sed: Protect
$
and special characters - Error handling: Use
|| echo "fallback"
for expected failures
- Start with
dry_run=True
to see generated commands - Test with small datasets before scaling up
- Use
verbose=True
during development - Check parameter list lengths match or use broadcasting
- Verify shell quoting for
$
and special characters - Test error conditions with missing files/tools
- Monitor cross product sizes to avoid command explosion
- Parallel execution scales with CPU cores - more cores = better performance
- I/O-bound tasks may not see linear speedup
- Network operations have natural bottlenecks
- Memory usage scales with output capturing - use
verbose=False
for large outputs - Cross products grow exponentially - 10×10×10 = 1000 commands!
Rule of thumb: If you find yourself writing complex shell logic or deeply nested parameters, consider breaking the problem into simpler parallel_zip
calls or using traditional Python with subprocess for that portion.
- Python 3.6+ (tested up to 3.12)
- Standard library only (no external Python packages required)
- GNU parallel (required)
- Standard Unix tools (bash, etc.)
command
: Command template string with{parameter}
placeholdersdry_run
: IfTrue
, return list of commands instead of executing themverbose
: IfTrue
, print commands as they're executedcross
: List of dictionaries for cross-product parameter generation
Note: As of v1.1.0, commands without parameters are supported. Running parallel_zip("ls", verbose=True)
will execute the command once.
Cross(**kwargs)
: Create cross-product parameter structurepz(command, lines=True)
: Quick shell command executionzipper()
: Lower-level interface for more controlparse_command()
: Parse command templatesexecute_command()
: Execute individual commands
parallel_zip
inherits GNU parallel's robust error handling:
- Failed commands don't stop the entire job
- Exit codes are preserved
- stderr/stdout are captured separately
- Partial results are available even if some jobs fail
This project is licensed under the GNU General Public License v3.0 (GPL-3.0), consistent with GNU parallel which this tool depends upon and extends.
When using parallel_zip
for academic publications, please cite GNU parallel as requested:
O. Tange (2018): GNU Parallel 2018, March 2018, https://doi.org/10.5281/zenodo.1146014
You can also get the citation notice by running:
parallel --citation
To run the test suite, first install ward:
pip install ward
Then run tests using:
# Run all tests
ward
# Run specific test file
ward --path test_parallel_zip.py
ward --path test_pz.py
Future tests will follow the naming convention test_<module>.py
.
Q: Why not just use GNU parallel directly?
A: While GNU parallel is excellent, parallel_zip
provides a more intuitive Python interface with parameter substitution, cross-products, and integration with Jupyter notebooks. It's designed specifically for the iterative, experimental nature of bioinformatics workflows.
Q: Can I use this outside of bioinformatics?
A: Absolutely! While developed for bioinformatics, parallel_zip
works with any shell commands and is useful for data processing, file manipulation, or any scenario where you need to run commands with varying parameters.
Q: Does this work on Windows?
A: parallel_zip
requires GNU parallel, which is primarily designed for Unix-like systems. It may work under WSL (Windows Subsystem for Linux) or Cygwin, but this is not officially supported.
Q: How does this compare to Python's multiprocessing?
A: parallel_zip
is designed for shell command execution and provides higher-level abstractions for parameter handling. Use Python's multiprocessing for pure Python code parallelization.
Q: Can I control the number of parallel jobs?
A: Yes! You can set GNU parallel options by setting the PARALLEL
environment variable or by modifying the underlying command. Full control over GNU parallel options will be added in a future version.
Q: When should I use pz() vs parallel_zip()?
A: Use pz()
for simple commands or when you need the output for further Python processing. Use parallel_zip()
when you need parallelization, parameter substitution across multiple values, or cross-products.
Q: How do I handle AWK commands with $ variables?
A: Always use single quotes around AWK commands: pz("awk '{print $1}' file.txt")
. Double quotes allow shell expansion of $
variables.
Q: My cross product is generating too many commands, what should I do?
A: Cross products multiply: 10×10×10 = 1000 commands! Use dry_run=True
first to see how many will be generated. Consider breaking large cross products into smaller chunks or using broadcasting for some parameters.
- Changed quick shell function from
sh()
topz()
for clarity - Added dedicated test suite for
pz()
function (pz_test.sh) - All previous v1.1.0 changes included
- Added
pz()
function for simple shell command execution - Fixed: Commands without parameters now work (e.g.,
parallel_zip("ls", verbose=True)
) - Removed misleading "warning" messages in verbose mode
- Added comprehensive documentation about shell quoting with $
- Enhanced test suite with edge cases
- Initial release
- Core parameter substitution and cross-product functionality
- GNU parallel integration
- Comprehensive test suite
- Cross helper function
- PyPI package release
- Conda-forge package
- Direct GNU parallel options control
- Progress bars and job monitoring
- Integration with Slurm and other job schedulers
- Advanced error handling and retry mechanisms
Developed with ❤️ for the bioinformatics community