Unofficial example on creating Multi-Instance GPU (MIG) instances with NVIDIA Management Library (NVML) Go bindings.
Prerequisites:
Take A30 as an example:
- Clone this repo and
cd
into it:git clone https://github.com/j3soon/go-nvml-mig-create-instance.git cd go-nvml-mig-create-instance
- Launch docker container for Go:
Note:
docker run --rm -it --gpus all \ -v $(pwd):/workspace \ --cap-add=SYS_ADMIN \ -e NVIDIA_MIG_CONFIG_DEVICES=all \ golang # in the container cd /workspace
--runtime=nvidia
,-e NVIDIA_VISIBLE_DEVICES=all
, and-e NVIDIA_DRIVER_CAPABILITIES=all
may be required depending on your environment and use cases.
Alternatively, you can install Go on your host machine and skip this step. - Run the example and observe results:
go run main.go # List the available CIs and GIs nvidia-smi mig -lgi; nvidia-smi mig -lci; # Destroy all the CIs and GIs nvidia-smi mig -dci; nvidia-smi mig -dgi;
This should also work on A100/H100/H200 by substituting the MIG profile to a supported one.
To create MIG GIs and CIs, we should get instance profile information and then create instances based on the profile.
// Assuming the 0-th device is MIG-enabled
device, ret := nvml.DeviceGetHandleByIndex(0)
// Create GPU Instance
giProfileInfo, ret := device.GetGpuInstanceProfileInfo(nvml.GPU_INSTANCE_PROFILE_4_SLICE)
gi, ret := device.CreateGpuInstance(&giProfileInfo)
// Create Compute Instance
ciProfileInfo, ret := gi.GetComputeInstanceProfileInfo(nvml.COMPUTE_INSTANCE_PROFILE_2_SLICE, nvml.COMPUTE_INSTANCE_ENGINE_PROFILE_SHARED)
_, ret = gi.CreateComputeInstance(&ciProfileInfo)
The following source code references are based on
go-nvml v0.12.4-1
-
After getting the device handle of the 0-th GPU, we want to create a GI based on
CreateGpuInstance
. Take a look at its Go binding (ref: cpp, go, src):CreateGpuInstance(*GpuInstanceProfileInfo) (GpuInstance, Return)
-
We can see that it takes the reference of
GpuInstanceProfileInfo
as the argument. Take a look at its source (ref: cpp, go, src):/** * GPU instance profile information. */ typedef struct nvmlGpuInstanceProfileInfo_st { unsigned int id; //!< Unique profile ID within the device unsigned int isP2pSupported; //!< Peer-to-Peer support unsigned int sliceCount; //!< GPU Slice count unsigned int instanceCount; //!< GPU instance count unsigned int multiprocessorCount; //!< Streaming Multiprocessor count unsigned int copyEngineCount; //!< Copy Engine count unsigned int decoderCount; //!< Decoder Engine count unsigned int encoderCount; //!< Encoder Engine count unsigned int jpegCount; //!< JPEG Engine count unsigned int ofaCount; //!< OFA Engine count unsigned long long memorySizeMB; //!< Memory size in MBytes } nvmlGpuInstanceProfileInfo_t;
-
We suspect that these information isn't meant to be filled by hand. We should check the source for using the
GetGpuInstanceProfileInfo
API to retrieve these information (ref: cpp, go, src):/** * Get GPU instance profile information * * Information provided by this API is immutable throughout the lifetime of a MIG mode. * * For Ampere &tm; or newer fully supported devices. * Supported on Linux only. * * @param device The identifier of the target device * @param profile One of the NVML_GPU_INSTANCE_PROFILE_* * @param info Returns detailed profile information * * @return * - \ref NVML_SUCCESS Upon success * - \ref NVML_ERROR_UNINITIALIZED If library has not been successfully initialized * - \ref NVML_ERROR_INVALID_ARGUMENT If \a device, \a profile or \a info are invalid * - \ref NVML_ERROR_NOT_SUPPORTED If \a device doesn't support MIG or \a profile isn't supported * - \ref NVML_ERROR_NO_PERMISSION If user doesn't have permission to perform the operation */ nvmlReturn_t DECLDIR nvmlDeviceGetGpuInstanceProfileInfo(nvmlDevice_t device, unsigned int profile, nvmlGpuInstanceProfileInfo_t *info);
-
Seems like we need to pass a
NVML_GPU_INSTANCE_PROFILE_*
as theprofile
argument. Let's view the source (ref: cpp, go, src):/** * GPU instance profiles. * * These macros should be passed to \ref nvmlDeviceGetGpuInstanceProfileInfo to retrieve the * detailed information about a GPU instance such as profile ID, engine counts. */ #define NVML_GPU_INSTANCE_PROFILE_1_SLICE 0x0 #define NVML_GPU_INSTANCE_PROFILE_2_SLICE 0x1 #define NVML_GPU_INSTANCE_PROFILE_3_SLICE 0x2 #define NVML_GPU_INSTANCE_PROFILE_4_SLICE 0x3 #define NVML_GPU_INSTANCE_PROFILE_7_SLICE 0x4 #define NVML_GPU_INSTANCE_PROFILE_8_SLICE 0x5 #define NVML_GPU_INSTANCE_PROFILE_6_SLICE 0x6 #define NVML_GPU_INSTANCE_PROFILE_1_SLICE_REV1 0x7 #define NVML_GPU_INSTANCE_PROFILE_2_SLICE_REV1 0x8 #define NVML_GPU_INSTANCE_PROFILE_1_SLICE_REV2 0x9 #define NVML_GPU_INSTANCE_PROFILE_COUNT 0xA
Please note that the
NVML_GPU_INSTANCE_PROFILE_COUNT
here is only a trick to get the number of profiles. It is not meant to be used as a profile. -
We can see that our hypothesis is correct based on the comments. We use
NVML_GPU_INSTANCE_PROFILE_4_SLICE
in our example.
-
After creating a GI, we want to create a CI based on
CreateComputeInstance
. Take a look at its Go binding (ref: cpp, go, src):GpuInstanceCreateComputeInstance(GpuInstance, *ComputeInstanceProfileInfo) (ComputeInstance, Return)
-
Similar to the case in creating GIs, we'll need a
ComputeInstanceProfileInfo
. Let's look at its source (ref: cpp, go, src):/** * Compute instance profile information. */ typedef struct nvmlComputeInstanceProfileInfo_st { unsigned int id; //!< Unique profile ID within the GPU instance unsigned int sliceCount; //!< GPU Slice count unsigned int instanceCount; //!< Compute instance count unsigned int multiprocessorCount; //!< Streaming Multiprocessor count unsigned int sharedCopyEngineCount; //!< Shared Copy Engine count unsigned int sharedDecoderCount; //!< Shared Decoder Engine count unsigned int sharedEncoderCount; //!< Shared Encoder Engine count unsigned int sharedJpegCount; //!< Shared JPEG Engine count unsigned int sharedOfaCount; //!< Shared OFA Engine count } nvmlComputeInstanceProfileInfo_t;
-
Similarly, let's check the source for
GetComputeInstanceProfileInfo
API (ref: cpp, go, src):/** * Get compute instance profile information. * * Information provided by this API is immutable throughout the lifetime of a MIG mode. * * For Ampere &tm; or newer fully supported devices. * Supported on Linux only. * * @param gpuInstance The identifier of the target GPU instance * @param profile One of the NVML_COMPUTE_INSTANCE_PROFILE_* * @param engProfile One of the NVML_COMPUTE_INSTANCE_ENGINE_PROFILE_* * @param info Returns detailed profile information * * @return * - \ref NVML_SUCCESS Upon success * - \ref NVML_ERROR_UNINITIALIZED If library has not been successfully initialized * - \ref NVML_ERROR_INVALID_ARGUMENT If \a gpuInstance, \a profile, \a engProfile or \a info are invalid * - \ref NVML_ERROR_NOT_SUPPORTED If \a profile isn't supported * - \ref NVML_ERROR_NO_PERMISSION If user doesn't have permission to perform the operation */ nvmlReturn_t DECLDIR nvmlGpuInstanceGetComputeInstanceProfileInfo(nvmlGpuInstance_t gpuInstance, unsigned int profile, unsigned int engProfile, nvmlComputeInstanceProfileInfo_t *info);
-
We should pass a
NVML_COMPUTE_INSTANCE_PROFILE_*
as the first (profile
) argument. Let's view the source (ref: cpp, go, src):/** * Compute instance profiles. * * These macros should be passed to \ref nvmlGpuInstanceGetComputeInstanceProfileInfo to retrieve the * detailed information about a compute instance such as profile ID, engine counts */ #define NVML_COMPUTE_INSTANCE_PROFILE_1_SLICE 0x0 #define NVML_COMPUTE_INSTANCE_PROFILE_2_SLICE 0x1 #define NVML_COMPUTE_INSTANCE_PROFILE_3_SLICE 0x2 #define NVML_COMPUTE_INSTANCE_PROFILE_4_SLICE 0x3 #define NVML_COMPUTE_INSTANCE_PROFILE_7_SLICE 0x4 #define NVML_COMPUTE_INSTANCE_PROFILE_8_SLICE 0x5 #define NVML_COMPUTE_INSTANCE_PROFILE_6_SLICE 0x6 #define NVML_COMPUTE_INSTANCE_PROFILE_1_SLICE_REV1 0x7 #define NVML_COMPUTE_INSTANCE_PROFILE_COUNT 0x8
-
We use
COMPUTE_INSTANCE_PROFILE_2_SLICE
for the first argument in our example. As for the second argument (engProfile
), let's also look at the source (ref: cpp, go, src):#define NVML_COMPUTE_INSTANCE_ENGINE_PROFILE_SHARED 0x0 //!< All the engines except multiprocessors would be shared #define NVML_COMPUTE_INSTANCE_ENGINE_PROFILE_COUNT 0x1
We can only use
COMPUTE_INSTANCE_ENGINE_PROFILE_SHARED
for the second argument in our example.Although we currently only have the ability to share GPU engines (Copy Engine (CE), NVENC, NVDEC, NVJPEG, Optical Flow Accelerator (OFA), etc.) between CIs within the same GI, this struct may be extended to support isolating these engines for each CI within the same GI in the future.
Some references I found useful during the investigation.
API References (Useful for searching API definitions):
Thanks Hsu-Tzu Ting for discussions.