cudaq: A GPU-Aware Distributed Job Dispatcher

A Python-based job dispatcher that intelligently assigns compute tasks to GPUs based on real-time memory availability, with persistent job tracking and recovery capabilities.

🚀 Features

🎯 GPU-aware job scheduling based on free memory
📝 Persistent job tracking using JSONL files
🔄 Automatic state recovery after crashes/restarts
📊 Real-time job status monitoring
📁 Per-job log files with stdout/stderr capture
⚙️ Configurable thresholds and GPU selection

📦 Installation

Install prerequisites:
```
pip install pynvml pyyaml
```
Clone/download the dispatcher script:
```
git clone [your-repo-url-here]
```

⚡ Quick Start

Create a commands file (commands.txt):

python train.py --batch-size 128
python inference.py --input-dir ./data

Run the dispatcher:

python cudaq.py run --commands-file commands.txt --min-free-mem-mb 8000

Check status:
```
python cudaq.py status
```

⚙️ Configuration

Create config.yaml for persistent settings:

gpu_ids: [0, 1, 2]          # Which GPUs to use
min_free_mem_mb: 10000      # Minimum free memory required
poll_interval: 30           # Check interval in seconds
commands_file: jobs.txt     # Job commands source
log_dir: ./job_logs         # Log storage location
jobs_file: ./queue.jsonl    # Job tracking file

Start dispatcher with config:

python cudaq.py run --config config.yaml

📋 Job Management

Command File Format

Plain text file with one command per line:

# comments start with #
python train_resnet.py
python process_data.py --workers 4

Job Lifecycle

Pending: Waiting for GPU resources
Running: Actively executing on GPU
Completed: Finished successfully
Failed: Exited with error/crash

💾 Persistent Tracking

Jobs are tracked in JSONL format with:

PID and start time
Assigned GPU ID
Status history
Log file path
Full command string

📁 Log Files

Automatically created in log_dir
Format: job_YYYYMMDD_HHMMSS.log
Contains full stdout/stderr output
Path stored in job tracking file

🔍 Status Monitoring

View current job states:

python cudaq.py status

Sample output:

[→] python train.py     → GPU 0 → Running (PID 1234)
[✓] python infer.py     → GPU 1 → Completed
[ ]  python eval.py     → GPU N/A → Pending

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
logs		logs
.gitignore		.gitignore
README.md		README.md
commands.txt		commands.txt
config.yaml		config.yaml
cudaq.py		cudaq.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

cudaq: A GPU-Aware Distributed Job Dispatcher

🚀 Features

📦 Installation

⚡ Quick Start

⚙️ Configuration

📋 Job Management

Command File Format

Job Lifecycle

💾 Persistent Tracking

📁 Log Files

🔍 Status Monitoring

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

iceorb/cudaq

Folders and files

Latest commit

History

Repository files navigation

cudaq: A GPU-Aware Distributed Job Dispatcher

🚀 Features

📦 Installation

⚡ Quick Start

⚙️ Configuration

📋 Job Management

Command File Format

Job Lifecycle

💾 Persistent Tracking

📁 Log Files

🔍 Status Monitoring

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages