Getting started with MIG partitioning
Multi-instance GPU (MIG) mode is supported only by NVIDIA GPUs based on Ampere, Hopper and newer architectures.
To enable Dynamic MIG Partitioning on a certain node, the following prerequisites must be met:
- if a node has multiple GPUs, all the GPUs must be of the same model
- all the GPUs of the nodes for which you want to enable MIG partitioning must have MIG mode enabled
Enable MIG mode
By default, MIG is not enabled on GPUs. In order to enable it, SSH into the node and run the following command for each GPU you want to enable MIG, where
<index> corresponds to the index of each GPU:
Depending on the kind of machine you are using, it may be necessary to reboot the node after enabling MIG mode for one of its GPUs.
You can check whether MIG mode has been successfully enabled by running the following command and checking if you get a similar output:
For more information and troubleshooting you can refer to th NVIDIA documentation.
Enable automatic partitioning
You can enable automatic MIG partitioning on a node by adding to it the following label:
The label delegates to
nos the management of the MIG resources of all the GPUs of that node, so you don't have to manually configure the MIG geometry of the GPUs anymore:
nos will dynamically create and delete the MIG profiles according to the resources requested by the pods submitted to the cluster, within the limits of the possible MIG geometries supported by each GPU model.
The available MIG geometries supported by each GPU model are defined in a ConfigMap, which by default contains with the supported geometries of the most popular GPU models. You can override or extend the values of this ConfigMap by editing the field
gpuPartitioner.knownMigGeometries of the installation chart.
Create pods requesting MIG resources
There is no need to manually create and manage MIG configurations. You can simply submit your Pods to the cluster and the requested MIG devices are automatically provisioned.
You can make your pods request slices of GPU by specifying MIG devices in their containers requests:
In the example above, the pod requests a slice of a 10GB of memory, which is the smallest unit available in
NVIDIA-A100-80GB-PCIe GPUs. If in your cluster you have different GPU models, the
nos might not be able to create the specified MIG resource. You can find the MIG profiles supported by each GPU model in the NVIDIA documentation.
Each container is supposed to request at most one MIG device. If a container needs more resources, then it should ask for a larger, single device as opposed to multiple smaller devices.