A dead lock in DNS response handler #5565

tnqn · 2023-10-10T16:26:02Z

Describe the bug

It's reported that Pods running a particular Node can't resolve any domains since a time point. antrea-agent shows the following errors:

E1003 14:11:07.282523       1 packetin.go:133] "PacketIn handler failed to process packet" err="rules not synced within 2s for DNS reply, dropping packet" handler="dnsresponse"
E1003 14:11:09.283534       1 packetin.go:133] "PacketIn handler failed to process packet" err="rules not synced within 2s for DNS reply, dropping packet" handler="dnsresponse"
E1003 14:11:11.284233       1 packetin.go:133] "PacketIn handler failed to process packet" err="rules not synced within 2s for DNS reply, dropping packet" handler="dnsresponse"
E1003 14:11:13.284435       1 packetin.go:133] "PacketIn handler failed to process packet" err="rules not synced within 2s for DNS reply, dropping packet" handler="dnsresponse"
E1003 14:11:15.285072       1 packetin.go:133] "PacketIn handler failed to process packet" err="rules not synced within 2s for DNS reply, dropping packet" handler="dnsresponse"
E1003 14:11:17.285598       1 packetin.go:133] "PacketIn handler failed to process packet" err="rules not synced within 2s for DNS reply, dropping packet" handler="dnsresponse"
E1003 14:11:19.286775       1 packetin.go:133] "PacketIn handler failed to process packet" err="rules not synced within 2s for DNS reply, dropping packet" handler="dnsresponse"
E1003 14:11:21.287884       1 packetin.go:133] "PacketIn handler failed to process packet" err="rules not synced within 2s for DNS reply, dropping packet" handler="dnsresponse"
E1003 14:11:23.289140       1 packetin.go:133] "PacketIn handler failed to process packet" err="rules not synced within 2s for DNS reply, dropping packet" handler="dnsresponse"

After investigation, I found a dead lock in fqdnController.syncDirtyRules:

antrea/pkg/agent/controller/networkpolicy/fqdn.go

Lines 499 to 515 in f43477f

    
           if !addressUpdate { 
        
           	f.ruleSyncTracker.mutex.Lock() 
        
           	defer f.ruleSyncTracker.mutex.Unlock() 
        
           	// If there is no address update for this FQDN, and rules selecting this FQDN 
        
           	// were all previously realized successfully, then there will be no dirty rules 
        
           	// left to be synced. On the contrary, if some rules that select this FQDN are 
        
           	// still in the dirtyRules set of the ruleSyncTracker, then only those rules 
        
           	// should be retried for reconciliation, and packetOut shall be blocked. 
        
           	dirtyRules = f.ruleSyncTracker.dirtyRules.Intersection(dirtyRules) 
        
           } 
        
           if len(dirtyRules) > 0 { 
        
           	klog.V(4).InfoS("Dirty rules blocking packetOut", "dirtyRules", dirtyRules) 
        
           	f.ruleSyncTracker.subscribe(waitCh, dirtyRules) 
        
           	for r := range dirtyRules { 
        
           		f.dirtyRuleHandler(r) 
        
           	} 
        
           } else {

.
If dirtyRules is not empty, f.ruleSyncTracker.subscribe will try to acquire the same lock another time and will never succeed, then all DNS responses won't be sent out.

To Reproduce

I haven't reproduced myself but it should be reproducible by constructing a case that would make the above dirtyRule not empty.

Versions:

Antrea version (Docker image tag).
Found in v1.11.3 but perhaps existing in all versions.

The text was updated successfully, but these errors were encountered:

tnqn · 2023-10-10T16:27:22Z

@Dyanngg @GraysonWu could one of you prioritize this one？

tnqn · 2023-10-12T16:46:54Z

I haven't figured out why f.ruleSyncTracker.dirtyRules is not empty when there is no NetworkPolicy realization error and packetin events are handled sequentially. But I found something suspicious, sharing it to see if others could get a complete picture of what happend:

When NetworkPolicyController realizes a FQDN rule, its reconciler acquires priorityAssigner.mutex first, then call fqdnController.addFQDNRule which acquires fqdnSelectorMutex.
When Packetin handler processes a DNS packetin event, fqdnController acquires fqdnSelectorMutex first, then notify NetworkPolicy reconciler to realize dirty FQDN rules, it releases fqdnSelectorMutex after that and waits for the notification of rules being realized.
When making DNS requests, fqdnController acquires fqdnSelectorMutex and releases it, there are 4 such workers.

Could the following happen?

1st DNS packet enqueued a dirty rule.
fqdn workers made DNS requests and hold fqdnSelectorMutex alternately.
rule reconciler tried to acquire fqdnSelectorMutex but didn't succeed in 2s.
1st DNS packet timed out and dropped the packet and left a dirty rule.
2nd DNS packet's handler acquired fqdnSelectorMutex and ruleSyncTracker.mutex, and tried to acquire the second mutex another time due the existence of dirty rule

Dyanngg · 2023-10-12T17:28:37Z

I'm not too convinced about the above theory, the reason is that makeDNSRequest does not hold the fqdnSelectorMutex until there is a dns response received. After the receipt of a response, it holds the fqdnSelectorMutex only to find out dirtyRules and enqueue them, which should be non-blocking, unless obviously a deadlock happens in one of the fqdn workers. So I don't feel (3) will happen unless a deadlock is already there. However, we can't use that assumption to lead to (5), which is suppose to prove some sequence of events led to a deadlock.

tnqn added kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. area/network-policy Issues or PRs related to network policies. labels Oct 10, 2023

tnqn assigned Dyanngg and GraysonWu Oct 10, 2023

Dyanngg mentioned this issue Oct 10, 2023

Fix deadlock when accessing dirtyRules in fqdn controller #5566

Merged

tnqn mentioned this issue Oct 11, 2023

Add goroutine stack dump in support bundle #5538

Merged

tnqn mentioned this issue Oct 13, 2023

FlowExporter floods log file with confusing logs #5573

Closed

tnqn closed this as completed in #5566 Oct 16, 2023

tnqn added the reported-by/end-user Issues reported by end users. label Dec 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A dead lock in DNS response handler #5565

A dead lock in DNS response handler #5565

tnqn commented Oct 10, 2023

tnqn commented Oct 10, 2023

tnqn commented Oct 12, 2023

Dyanngg commented Oct 12, 2023

A dead lock in DNS response handler #5565

A dead lock in DNS response handler #5565

Comments

tnqn commented Oct 10, 2023

tnqn commented Oct 10, 2023

tnqn commented Oct 12, 2023

Dyanngg commented Oct 12, 2023