Skip to content

Critical: Rodan GPU Services Affected by Arbutus Cloud Infrastructure Changes #1315

@homework36

Description

@homework36

Current Status

Our Rodan server instances (both production and staging) are currently running on Arbutus Cloud's vGPU infrastructure. While we're experiencing some issues with PACO training jobs due to a mysterious runtime error that we have not been able to fix for months, other GPU-accelerated workloads are functioning normally on production, and all GPU jobs are working as expected on staging.

Upcoming Changes

vGPU License Expiration

  • Critical Date: July 31, 2025 - Current vGPU license expires on Arbutus, which is something we have no control over as Arbutus users
  • Impact: After this date, GPU accelerator functionality will be unavailable
  • Instance Status: Instances will continue running but without GPU acceleration

Infrastructure Upgrade

  • Current: vGPU flavors (software-based time-slicing)
  • Future: Multi-Instance GPU (MIG) flavors with larger VRAM per virtualized GPU in Early September 2025
  • Key Difference: MIG provides hardware-level isolation vs. the current vGPU's software-based approach

Expected Benefits After Migration

  • Larger VRAM allocation per GPU instance
  • Better hardware-level isolation
  • Improved performance and reliability

What This Means for Us

Expected Impact During Gap Period

All our GPU jobs should theoretically still be able to run after July 31st, however, they will either be very slow to run without GPU acceleration, or fail due to insufficient memory or other resource constraints. Please be prepared for this inconvenience.

Current Challenges & Mitigation Efforts

What makes this situation more challenging is that our lab machines, which are equipped with Apple M chips, are not compatible with the current Docker images for GPU jobs. We are actively working to rewrite and figure out a way to utilize the GPUs on our lab machines first, and hopefully we can help handle GPU jobs during this transition period.

Additional Concerns

Since we have limited information regarding the new Arbutus GPU infrastructure and many of our GPU-related dependencies are very old, there might be other unforeseen technical problems when transitioning to the new MIG-based system. Please be patient with all of this. We will continue monitoring the situation and will update this post as soon as more information becomes available from Arbutus Cloud.

Updated July 10, 2025.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions