Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A dead lock in DNS response handler #5565

Closed
tnqn opened this issue Oct 10, 2023 · 3 comments · Fixed by #5566
Closed

A dead lock in DNS response handler #5565

tnqn opened this issue Oct 10, 2023 · 3 comments · Fixed by #5566
Assignees
Labels
area/network-policy Issues or PRs related to network policies. kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. reported-by/end-user Issues reported by end users.

Comments

@tnqn
Copy link
Member

tnqn commented Oct 10, 2023

Describe the bug

It's reported that Pods running a particular Node can't resolve any domains since a time point. antrea-agent shows the following errors:

E1003 14:11:07.282523       1 packetin.go:133] "PacketIn handler failed to process packet" err="rules not synced within 2s for DNS reply, dropping packet" handler="dnsresponse"
E1003 14:11:09.283534       1 packetin.go:133] "PacketIn handler failed to process packet" err="rules not synced within 2s for DNS reply, dropping packet" handler="dnsresponse"
E1003 14:11:11.284233       1 packetin.go:133] "PacketIn handler failed to process packet" err="rules not synced within 2s for DNS reply, dropping packet" handler="dnsresponse"
E1003 14:11:13.284435       1 packetin.go:133] "PacketIn handler failed to process packet" err="rules not synced within 2s for DNS reply, dropping packet" handler="dnsresponse"
E1003 14:11:15.285072       1 packetin.go:133] "PacketIn handler failed to process packet" err="rules not synced within 2s for DNS reply, dropping packet" handler="dnsresponse"
E1003 14:11:17.285598       1 packetin.go:133] "PacketIn handler failed to process packet" err="rules not synced within 2s for DNS reply, dropping packet" handler="dnsresponse"
E1003 14:11:19.286775       1 packetin.go:133] "PacketIn handler failed to process packet" err="rules not synced within 2s for DNS reply, dropping packet" handler="dnsresponse"
E1003 14:11:21.287884       1 packetin.go:133] "PacketIn handler failed to process packet" err="rules not synced within 2s for DNS reply, dropping packet" handler="dnsresponse"
E1003 14:11:23.289140       1 packetin.go:133] "PacketIn handler failed to process packet" err="rules not synced within 2s for DNS reply, dropping packet" handler="dnsresponse"

After investigation, I found a dead lock in fqdnController.syncDirtyRules:

if !addressUpdate {
f.ruleSyncTracker.mutex.Lock()
defer f.ruleSyncTracker.mutex.Unlock()
// If there is no address update for this FQDN, and rules selecting this FQDN
// were all previously realized successfully, then there will be no dirty rules
// left to be synced. On the contrary, if some rules that select this FQDN are
// still in the dirtyRules set of the ruleSyncTracker, then only those rules
// should be retried for reconciliation, and packetOut shall be blocked.
dirtyRules = f.ruleSyncTracker.dirtyRules.Intersection(dirtyRules)
}
if len(dirtyRules) > 0 {
klog.V(4).InfoS("Dirty rules blocking packetOut", "dirtyRules", dirtyRules)
f.ruleSyncTracker.subscribe(waitCh, dirtyRules)
for r := range dirtyRules {
f.dirtyRuleHandler(r)
}
} else {
.
If dirtyRules is not empty, f.ruleSyncTracker.subscribe will try to acquire the same lock another time and will never succeed, then all DNS responses won't be sent out.

To Reproduce

I haven't reproduced myself but it should be reproducible by constructing a case that would make the above dirtyRule not empty.

Versions:

  • Antrea version (Docker image tag).
    Found in v1.11.3 but perhaps existing in all versions.
@tnqn tnqn added kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. area/network-policy Issues or PRs related to network policies. labels Oct 10, 2023
@tnqn
Copy link
Member Author

tnqn commented Oct 10, 2023

@Dyanngg @GraysonWu could one of you prioritize this one?

@tnqn
Copy link
Member Author

tnqn commented Oct 12, 2023

I haven't figured out why f.ruleSyncTracker.dirtyRules is not empty when there is no NetworkPolicy realization error and packetin events are handled sequentially. But I found something suspicious, sharing it to see if others could get a complete picture of what happend:

  • When NetworkPolicyController realizes a FQDN rule, its reconciler acquires priorityAssigner.mutex first, then call fqdnController.addFQDNRule which acquires fqdnSelectorMutex.
  • When Packetin handler processes a DNS packetin event, fqdnController acquires fqdnSelectorMutex first, then notify NetworkPolicy reconciler to realize dirty FQDN rules, it releases fqdnSelectorMutex after that and waits for the notification of rules being realized.
  • When making DNS requests, fqdnController acquires fqdnSelectorMutex and releases it, there are 4 such workers.

Could the following happen?

  1. 1st DNS packet enqueued a dirty rule.
  2. fqdn workers made DNS requests and hold fqdnSelectorMutex alternately.
  3. rule reconciler tried to acquire fqdnSelectorMutex but didn't succeed in 2s.
  4. 1st DNS packet timed out and dropped the packet and left a dirty rule.
  5. 2nd DNS packet's handler acquired fqdnSelectorMutex and ruleSyncTracker.mutex, and tried to acquire the second mutex another time due the existence of dirty rule

@Dyanngg
Copy link
Contributor

Dyanngg commented Oct 12, 2023

I'm not too convinced about the above theory, the reason is that makeDNSRequest does not hold the fqdnSelectorMutex until there is a dns response received. After the receipt of a response, it holds the fqdnSelectorMutex only to find out dirtyRules and enqueue them, which should be non-blocking, unless obviously a deadlock happens in one of the fqdn workers. So I don't feel (3) will happen unless a deadlock is already there. However, we can't use that assumption to lead to (5), which is suppose to prove some sequence of events led to a deadlock.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/network-policy Issues or PRs related to network policies. kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. reported-by/end-user Issues reported by end users.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants