K8s Vertical Pod Autoscaler's Algorithm

06 Aug, 2024

I tried to understand how the Vertical Pod Autoscaler(VPA) works and found a comment[1] by this gentleman(@yashbutwala) explaining in brief how it works:

recommendations are calculated using decaying histogram of weighted samples from the metrics server, where the newer samples are assigned higher weights; older samples are decaying and hence affect less and less w.r.t. to the recommendations. CPU is calculated using the 90th percentile of all cpu samples, and memory is calculated using the 90th percentile peak over the 8 day window

I wanted to dig deeper to get at the essence of the recommender’s algorithm. This is my attempt to document my digging.

Note: This post is latest as on 402ea4176.

Background
The algorithm
Conclusion
References

Background

The VPA solves the problem of optimal resource allocation for CPU and Memory. The solutions to this problem take a few different approaches[2]:

Threshold-based rules
Re-inforcement Learning
Queueing Theory
Time-series analysis
Control Theory

The VPA takes the Time-series analysis approach to solving the problem.

I don’t go into more high-level detail about the VPA in this post.
TL;DR: It reads metrics from Kubernetes’ metrics server and analyzes samples from time-series for recommending resources.
This talk is a really good introduction to the VPA:

I highly recommend reading Google’s paper on autoscalers for their clusters[3] as it seems to form the basis for the VPA’s design. I refer to it as the “reference paper” in this post.

The algorithm

In essence, the vertical pod autoscaler’s algorithm aggregates the sample data collected to compute a recommendation. This section is structured to present the individual abstractions that the algorithm can be thought of as composed of in order of application.

Raw Signal

The raw signal is composed of samples of the resource usage captured at given intervals.
For CPU, simply a CPU usage sample is considered. It corresponds to the metric container_cpu_usage_seconds_total. For Memory, the maximum sample value is considered for a given aggregation interval(default=24h). It corresponds to the metric container_memory_working_set_bytes.

The Histogram

The raw resource usage signal is then aggregated into a histogram to save space. The Histogram partitions the input metric(resource usage) into buckets. The bucket’s value is the associated weight of the sample. (We get into this in detail in the next section). To keep things simple, we consider the weight of the sample to be the number of ocurrences of a sample for now.

Exponential bucketing scheme

Generally, the buckets of a histogram are partitioned equally. Eg: For a range $[1, 10]$ with $5$ buckets, each bucket would contain $2$ values. However, for a large range of values, a large number of buckets may be sparsely populated. To avoid that, the VPA uses a histogram with an exponential bucketing scheme where the bucket sizes grow exponentially. This allows for having a lower granularity for extremely large values while having a high granularity for smaller and more likely values.

This comment captures it well:

It also uses a config option named espilon but we don’t need to know about it for the purpose of this post.

Default ratio: 1.05 i.e. 5% larger

Percentiles

The percentile of a histogram is defined as the weight of the bucket that contains the $p$-th percentile value. This gives us a way to approximately compute a percentile.

It is computed as:

Default percentile: 90% for target.

The weight of a sample

In a bare histogram, the weight of a bucket is it’s ocurrence. For more sophisticated use-cases, the weight assigned can be tweaked to introduce a “bias”. The recommender algorithm introduces the following biases:

Exponential Decay multiplier
Load-Adjusted CPU usage multiplier

Exponential Decay Multiplier

The Google Autopilot paper mentions the following rationale behind introducing an exponential decay multiplier:

We want the limits to increase swiftly in response to rising usage, but reduce slowly after the load decreases to avoid a too-rapid response to temporary downward workload fluc- tuations.

To “bias” the weight of the histogram’s samples towards more recent samples, the weight of a sample is multiplied by an exponentially decaying multiplier:

$$2^{\Large {\frac{t - t_0}{\lambda}}}$$

where $t - t_0$ is the relative age of given sample with respect to a reference timestamp, and $\lambda$ is the half-life(24 hours by default).

Load-adjusted CPU usage multiplier

The reference paper(define properly) mentions the following rationale for using load-adjusted weights for CPU:

In many cases, we want to ensure that a given percentile of the offered load can be served when the limit is set to accommodate the offered load, rather than simply a count of times that instantaneous observed load can be handled – i.e, we want to weight the calculation by the load, not the sample count.

A comment in code explains it the best: This figure in the reference paper highlights this point visually: Load-adjusted CPU usage figure

Safety margin

This is simply a %age margin that the recommended request is scaled by for safety.

$$recommendation = (1 + safetyMargin) \times recommendation$$

Default value: 15%

Confidence Multiplier

As the VPA recommends resources based on historical resource usage, if the available data is negligible, the recommendation isn’t likely to be correct. Therefore, the confidence multiplier was introduced. From a comment in code:

// ...  This means
// that the updater will be less eager to evict pods with short history
// in order to reclaim unused resources.

The confidence multiplier is a heuristic defined as:

$$\Biggl[{1 + {\frac{multiplier}{confidence}}} \Biggr] ^ {exponent}$$

$multiplier$ and $exponent$ are heuristic values specified statically in code.

$confidence$ is a measure of the available historical data. It is defined as:

$$confidence = min(lifespan, samplesAmount)$$

where $lifespan = t_n - t_0$ measured in days, and $samplesAmount = {\large \frac{numSamples}{60 \times 24} }$ such that it represents the number of days for which samples are available assuming a rate of 1 sample/minute.

Upper bound and Lower bound

The VPA recommender computes three recommendations with different settings - Lower bound, Target, and Upper bound. The updater component of the VPA uses this update a pod if the resource request is outside the range $(lowerBound, upperBound)$.

Minimum Resources

The recommender imposes a minimum value for both resources - CPU and Memory.

Default value:

CPU - 0.025 vCPU
Memory - 250 MB

Conclusion

To understand the algorithm, I think the final piece of the puzzle involves the considerations for choosing a suitable algorithm that the reference paper mentions. For memory, the risk of using an incorrect recommendation can lead to OOM errors. For critical workloads, the maximum peak memory usage is considered to avoid any disruption due to an OOM. For less critical ones, percentiles from $P_{98}$ to even lower percentiles can be used. For CPU, the risk is much less grave as an incorrect recommendation can result in CPU throttling in the worst case. Thus, we can use a percentile(eg: $P_{95}$) or even average if the workload is not CPU-bound.

References

Thanks to Shlok for the initial discussion that sparked my interest in the VPA, Ashu for encouraging me to explore the VPA, Suraj and Gaurav for reviewing this post.

#Algorithms #Kubernetes #Vertical-Pod-Autoscaler