-
Notifications
You must be signed in to change notification settings - Fork 13
Description
Current Status
Our Rodan server instances (both production and staging) are currently running on Arbutus Cloud's vGPU infrastructure. While we're experiencing some issues with PACO training jobs due to a mysterious runtime error
that we have not been able to fix for months, other GPU-accelerated workloads are functioning normally on production, and all GPU jobs are working as expected on staging.
Upcoming Changes
vGPU License Expiration
- Critical Date: July 31, 2025 - Current vGPU license expires on Arbutus, which is something we have no control over as Arbutus users
- Impact: After this date, GPU accelerator functionality will be unavailable
- Instance Status: Instances will continue running but without GPU acceleration
Infrastructure Upgrade
- Current: vGPU flavors (software-based time-slicing)
- Future: Multi-Instance GPU (MIG) flavors with larger VRAM per virtualized GPU in Early September 2025
- Key Difference: MIG provides hardware-level isolation vs. the current vGPU's software-based approach
Expected Benefits After Migration
- Larger VRAM allocation per GPU instance
- Better hardware-level isolation
- Improved performance and reliability
What This Means for Us
Expected Impact During Gap Period
All our GPU jobs should theoretically still be able to run after July 31st, however, they will either be very slow to run without GPU acceleration, or fail due to insufficient memory or other resource constraints. Please be prepared for this inconvenience.
Current Challenges & Mitigation Efforts
What makes this situation more challenging is that our lab machines, which are equipped with Apple M chips, are not compatible with the current Docker images for GPU jobs. We are actively working to rewrite and figure out a way to utilize the GPUs on our lab machines first, and hopefully we can help handle GPU jobs during this transition period.
Additional Concerns
Since we have limited information regarding the new Arbutus GPU infrastructure and many of our GPU-related dependencies are very old, there might be other unforeseen technical problems when transitioning to the new MIG-based system. Please be patient with all of this. We will continue monitoring the situation and will update this post as soon as more information becomes available from Arbutus Cloud.
Updated July 10, 2025.