[VPA] Incorrect OOM handling #3078

krzysied · 2020-04-22T09:32:00Z

It often happens that OOM memory values are not recorder.
In logs there are many entries like:

W0421 16:47:20.131118 1 cluster_feeder.go:406] Failed to record OOM {Timestamp:2020-04-21 16:46:24 +0000 UTC Memory:476450463 ContainerID:{PodID:{Namespace:vertical-pod-autoscaling-4228 PodName:hamster-6ddf9fb54d-wzzzd} ContainerName:hamster}}. Reason: error while recording OOM for {{vertical-pod-autoscaling-4228 hamster-6ddf9fb54d-wzzzd} hamster}, Reason: adding OOM sample failed

krzysied · 2020-04-22T09:32:37Z

/cc @bskiba @jbartosik
/assign

krzysied · 2020-04-22T09:39:51Z

Based on extensive debugging and deep code analysis I found following logic assumptions:

OOM are treated as metric samples

autoscaler/vertical-pod-autoscaler/pkg/recommender/model/container.go

Line 206 in 40d5205

if !container.addMemorySample(&oomMemorySample, true) {

If sample is older that the newest processed one, the sample skipped

autoscaler/vertical-pod-autoscaler/pkg/recommender/model/container.go

Lines 137 to 139 in 40d5205

    
           if !sample.isValid(ResourceMemory) || ts.Before(container.lastMemorySampleStart) { 
        
           	return false // Discard invalid or outdated samples. 
        
           }

Metrics are processed before OOMs

autoscaler/vertical-pod-autoscaler/pkg/recommender/input/cluster_feeder.go

Lines 377 to 413 in 40d5205

    
           func (feeder *clusterStateFeeder) LoadRealTimeMetrics() { 
        
           	containersMetrics, err := feeder.metricsClient.GetContainersMetrics() 
        
           	if err != nil { 
        
           		klog.Errorf("Cannot get ContainerMetricsSnapshot from MetricsClient. Reason: %+v", err) 
        
           	} 
        
           	sampleCount := 0 
        
           	droppedSampleCount := 0 
        
           	for _, containerMetrics := range containersMetrics { 
        
           		for _, sample := range newContainerUsageSamplesWithKey(containerMetrics) { 
        
           			if err := feeder.clusterState.AddSample(sample); err != nil { 
        
           				// Not all pod states are tracked in memory saver mode 
        
           				if _, isKeyError := err.(model.KeyError); isKeyError && feeder.memorySaveMode { 
        
           					continue 
        
           				} 
        
           				klog.Warningf("Error adding metric sample for container %v: %v", sample.Container, err) 
        
           				droppedSampleCount++ 
        
           			} else { 
        
           				sampleCount++ 
        
           			} 
        
           		} 
        
           	} 
        
           	klog.V(3).Infof("ClusterSpec fed with #%v ContainerUsageSamples for #%v containers. Dropped #%v samples.", sampleCount, len(containersMetrics), droppedSampleCount) 
        
           Loop: 
        
           	for { 
        
           		select { 
        
           		case oomInfo := <-feeder.oomChan: 
        
           			klog.V(3).Infof("OOM detected %+v", oomInfo) 
        
           			if err = feeder.clusterState.RecordOOM(oomInfo.ContainerID, oomInfo.Timestamp, oomInfo.Memory); err != nil { 
        
           				klog.Warningf("Failed to record OOM %+v. Reason: %+v", oomInfo, err) 
        
           			} 
        
           		default: 
        
           			break Loop 
        
           		} 
        
           	} 
        
           	metrics_recommender.RecordAggregateContainerStatesCount(feeder.clusterState.StateMapSize()) 
        
           }

krzysied · 2020-04-22T09:45:53Z

The problematic scenario looks like:

There is OOM at 00:00:44
We have metric sample from 00:00:00 till 00:00:45
Metrics are not as precise as OOM values.

Recommender will process metrics, setting lastMemorySampleStart to 00:00:45. It will process OOM after metrics. OOM sample will be skipped due to its measurement time being before lastMemorySampleStart.

This results with lower memory value being stored.

Moreover there is always a chance for this unwanted behavior, so there can exist a scenario where memory is never correctly bumped up.

krzysied · 2020-04-24T10:07:40Z

Quick update on the impact of the change:

We are not loosing OOM memory samples. This allows VPA to achieve correct memory recommendation a few cycles (minutes) faster.
VPA has more deterministic behavior. Previously there was some chance to record a sample, so test duration varied. This often resulted with tests timing out.

Test duration consistently keeps below 5 minutes.

QuentinBisson · 2021-06-04T10:18:54Z

@krzysied would it make sense to make a release for this fix? VPA has not been released since January

krzysied · 2021-06-07T09:42:50Z

@QuentinBisson The fix was picked up in the newer versions of VPA. It's definitely available in VPA 0.8.1 and VPA 0.9.x.
I assume most people use k8s that has version above 1.13, so VPA 0.8.1 can be used -https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler#compatibility.
IMHO I don't feel a strong need for new 0.7 release.

QuentinBisson · 2021-06-07T10:18:06Z

Oh I'm sorry you are right. I did not notice this was from 2020 🤦🏻

k8s-ci-robot assigned krzysied Apr 22, 2020

krzysied mentioned this issue Apr 22, 2020

VPA - Always process OOM samples #3079

Merged

k8s-ci-robot closed this as completed in #3079 Apr 23, 2020

krzysied mentioned this issue Apr 24, 2020

[Cherry-pick] Always processing OOM samples #3088

Merged

krzysied mentioned this issue Apr 24, 2020

VPA - Always processing OOM samples #3089

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VPA] Incorrect OOM handling #3078

[VPA] Incorrect OOM handling #3078

krzysied commented Apr 22, 2020

krzysied commented Apr 22, 2020

krzysied commented Apr 22, 2020

krzysied commented Apr 22, 2020

krzysied commented Apr 24, 2020

QuentinBisson commented Jun 4, 2021

krzysied commented Jun 7, 2021

QuentinBisson commented Jun 7, 2021

[VPA] Incorrect OOM handling #3078

[VPA] Incorrect OOM handling #3078

Comments

krzysied commented Apr 22, 2020

krzysied commented Apr 22, 2020

krzysied commented Apr 22, 2020

krzysied commented Apr 22, 2020

krzysied commented Apr 24, 2020

QuentinBisson commented Jun 4, 2021

krzysied commented Jun 7, 2021

QuentinBisson commented Jun 7, 2021