nos allows you to schedule Pods requesting fractions of GPUs. The GPUs are automatically
partitioned into slices that can be requested by individual containers. In this way,
GPUs are shared among multiple Pods increasing the overall utilization.
The GPUs partitioning is performed automatically in real-time based on the requests of
the Pods in your cluster.
nos constantly watches the pending Pods and finds the best possible GPU partitioning configuration
to schedule the highest number of the ones requesting fractions of GPUs.
You can think of
nos as a Cluster Autoscaler for GPUs: instead of
adjusting the number of nodes and GPUs, it dynamically partitions them to maximize their utilization, leading to spare
GPU capacity. Then, you can schedule more Pods or reduce the number of GPU nodes needed, reducing infrastructure costs.
The GPU partitioning is performed either using Multi-instance GPU (MIG) or Multi-Process Service (MPS), depending on the partitioning mode you choose for each node.