A Python-based job dispatcher that intelligently assigns compute tasks to GPUs based on real-time memory availability, with persistent job tracking and recovery capabilities.
- 🎯 GPU-aware job scheduling based on free memory
- 📝 Persistent job tracking using JSONL files
- 🔄 Automatic state recovery after crashes/restarts
- 📊 Real-time job status monitoring
- 📁 Per-job log files with stdout/stderr capture
- ⚙️ Configurable thresholds and GPU selection
-
Install prerequisites:
pip install pynvml pyyaml
-
Clone/download the dispatcher script:
git clone [your-repo-url-here]
-
Create a commands file (
commands.txt
):python train.py --batch-size 128 python inference.py --input-dir ./data
-
Run the dispatcher:
python cudaq.py run --commands-file commands.txt --min-free-mem-mb 8000
-
Check status:
python cudaq.py status
Create config.yaml
for persistent settings:
gpu_ids: [0, 1, 2] # Which GPUs to use
min_free_mem_mb: 10000 # Minimum free memory required
poll_interval: 30 # Check interval in seconds
commands_file: jobs.txt # Job commands source
log_dir: ./job_logs # Log storage location
jobs_file: ./queue.jsonl # Job tracking file
Start dispatcher with config:
python cudaq.py run --config config.yaml
Plain text file with one command per line:
# comments start with #
python train_resnet.py
python process_data.py --workers 4
- Pending: Waiting for GPU resources
- Running: Actively executing on GPU
- Completed: Finished successfully
- Failed: Exited with error/crash
Jobs are tracked in JSONL format with:
- PID and start time
- Assigned GPU ID
- Status history
- Log file path
- Full command string
- Automatically created in
log_dir
- Format:
job_YYYYMMDD_HHMMSS.log
- Contains full stdout/stderr output
- Path stored in job tracking file
View current job states:
python cudaq.py status
Sample output:
[→] python train.py → GPU 0 → Running (PID 1234)
[✓] python infer.py → GPU 1 → Completed
[ ] python eval.py → GPU N/A → Pending