Using GPU Utilization to Effectively Scale Inference Servers

For your GKE inferencing work, use more intelligent autoscaling to minimize GPU consumption.

GPU usage is a measure of how long the GPU is active, or its duty cycle.

Even while LLM models are very beneficial for an increasing variety of use cases, running LLM inference tasks can be costly. If you're using the newest open models and infrastructure, autoscaling can help you optimize costs. This will ensure that you are meeting customer needs while only investing in the AI accelerators you need.

Google Kubernetes Engine (GKE) is a managed container orchestration solution that makes it simple to install, scale, and manage your LLM inference workloads. When you build up your inference workloads on GKE, Horizontal Pod Autoscaler (HPA) is a quick and simple way to ensure that your model servers scale with load. By modifying the HPA parameters to balance your provisioned hardware costs with your incoming traffic demands, you can achieve your desired inference server performance targets.

Since enabling autoscaling for LLM inference workloads can also be challenging, it has used ai-on-gke/benchmarks to examine multiple metrics for autoscaling on GPUs in order to provide best practices. This setup makes use of HPA and the Text-Generation-Inference (TGI) model server. Remember that these tests can be used with other inference servers that employ similar metrics, such vLLM.

Choosing the right metric

Here are some examples of metrics comparison experiments that are shown using Cloud Monitoring dashboards. Google used the HPA custom metrics stackdriver adaptor to run TGI with Llama 2 7b on a single L4 GPU g2-standard-16 computer for each experiment. The ai-on-gke locust-load-generation program was then used to generate traffic with varying request sizes. It used the same traffic load for each of the trials displayed below. Through experimentation, the following thresholds were established.

Remember that TGI's metric for the total time spent on prefilling and decoding, divided by the number of output tokens generated for each request, is represented by the mean-time-per-token graph. Using this statistic, you may analyze how autoscaling affects latency using different metrics.

GPU utilization

By default, the autoscaling metrics are CPU or memory usage. For workloads based on CPUs, this works well. These metrics are no longer a valid means of measuring job resource use alone, though, because inference servers now largely rely on GPUs. One metric that is similar to GPUs is GPU utilization. GPU utilization is a measure of the GPU duty cycle, or the amount of time the GPU is active.

What does GPU utilization mean?

GPU use is the proportion of a graphics processing unit's (GPU) processing power that is currently being utilized. GPUs are specialized hardware components that control complex mathematical operations for graphic rendering and parallel computing.

The relationship between the GPU utilization graph and the request mean-time-per-token graph is unclear. Despite a decrease in request mean-time-per-token, HPA continues to scale up due to an increase in GPU utilization. For LLM autoscaling, GPU utilization is not a helpful metric. The traffic the inference server is now handling makes it difficult to relate this measure. The GPU duty cycle data cannot tell us how much work the accelerator is doing or when it is operating at its peak efficiency because it does not measure flop utilization. Financially speaking, GPU utilization is inefficient since it tends to overprovision when compared to the other metrics below.

In conclusion, autoscaling inference workloads with GPU use is not recommended by Google.

Batch size

Due to the limits of the GPU utilization statistic, it additionally examined TGI's LLM server metrics. The LLM server metrics we examined are already available on the most popular inference servers.

One of the variables it selected was batch size (tgi_batch_current_size), which shows how many requests are handled in each iteration of inferencing.

There is a direct correlation between the current batch size graph and the request mean-time-per-token graph. Smaller batch sizes result in decreased latencies. Batch size is a great statistic for low latency optimization since it provides a clear view of the amount of traffic the inference server is currently handling. One disadvantage of the current batch size measure is that it was challenging to initiate scaling up while trying to reach maximum batch size and, consequently, maximum throughput because batch size can vary somewhat with different incoming request sizes. HPA must choose a figure that is somewhat smaller than the maximum batch size in order to ensure that it will result in a scale-up.

We recommend utilizing the current batch size metric if you wish to target a specific tail delay.

Size of the queue

The other TGI LLM server metrics parameter that was utilized was queue size (tgi_queue_size). The queue size is the number of requests that must wait in the inference server queue before being included in the current batch.

*Note that the pod count decreased as a result of the HPA starting a downscale after the standard five-minute stabilization period ended. You can readily modify this stabilization period window and other fundamental HPA design parameters to meet your traffic requirements.

We observe a direct relationship between the queue size graph and the request mean-time-per-token graph. Larger queue sizes result in higher latencies. It found that queue size is a great indicator for autoscaling inference workloads because it provides a clear picture of the amount of traffic the inference server is waiting to process. An increasing length of the queue indicates that the batch is filled. Because queue size is only determined by the number of requests in the queue and not by the number of requests being handled at any given time, autoscaling queue size cannot achieve latencies as low as batch size.

It recommends using queue size to boost performance and control tail delay.

Determining the goal value thresholds

To further illustrate the strength of the queue and batch size metrics, the profile-generator in ai-on-gke/benchmarks will decide which criteria are suitable for these trials. Considering this, it chose thresholds:

To illustrate an ideal throughput workload, it calculated the queue size at the moment when only latency was rising and throughput was no longer growing.
In order to mimic a workload that is sensitive to latency, it chose to autoscale on a batch size at a latency threshold of roughly 80% of the optimal throughput.

Each experiment employed two g2-standard-96 machines running TGI with Llama 2 7b on a single L4 GPU, enabling autoscaling between 1 and 16 copies using the HPA custom metrics stackdriver. Ai-on-gke's locust-load-generation program was used to generate traffic with varying request sizes. To simulate traffic spikes, we increased the load by 150% after determining that it stabilized at roughly ten repetitions.

We find that its target threshold can maintain the mean time per token below ~0.4s, even in the face of 150% traffic surges.
Size of batch

Note that the average batch size has decreased by nearly 60%, which corresponds to the roughly 60% drop in traffic.

We find that its target threshold can maintain the mean latency per token almost below ~0.3s, even with 150% traffic growth.

Less than 80% of the mean time per token is preserved by the batch size threshold set at roughly 80% of the maximum throughput, as opposed to the queue size threshold set at the maximum throughput.

In an effort to enhance autoscaling

By autoscaling with GPU use, you can overprovision LLM workloads, which would raise the cost of reaching your performance goals.

You can reach your latency or throughput goals while spending as little money as possible on accelerators by autoscaling utilizing LLM server measurements. Batch size allows you to target a specific tail latency. You can change the queue size to increase throughput.