Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

antrea-ovs container restarted multiple times causing antrea-agent container to be forever stuck in "Ready: False" state #871

Closed
alex-vmw opened this issue Jun 25, 2020 · 2 comments · Fixed by #873
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@alex-vmw
Copy link

Describe the bug
antrea-ovs container restarted multiple times causing antrea-agent container to be forever stuck in "Ready: False" state.

To Reproduce
Not sure how to reproduce but here is what we know happened:

  1. On 06/25/2020 6:17am Pacific Time we receive first alert from Prometheus indicating that antrea-ovs container on the node had restarted 6 times in the past hour.
  2. At 6:42am we started receiving alerts indicating that DNS tests were failing on the node and upon review we realized that container networking wasn't working.
  3. When we looked at the antrea pod we saw that antrea-agent pod was also restarted, but got stuck in "Ready: False" state:
antrea-agent-ghxvn                     1/2     Running   7          40h    10.170.168.35   10.170.168.35   <none>           <none>

  antrea-agent:
    State:          Running
      Started:      Thu, 25 Jun 2020 06:08:49 -0700
    Ready:          False
    Restart Count:  1

  antrea-ovs:
    State:          Running
      Started:      Thu, 25 Jun 2020 06:09:26 -0700
    Ready:          True
    Restart Count:  6

Events:
  Type     Reason     Age                     From                    Message
  ----     ------     ----                    ----                    -------
  Warning  Unhealthy  68s (x2499 over 6h57m)  kubelet, 10.170.168.35  Readiness probe failed: Get https://127.0.0.1:10350/healthz: dial tcp 127.0.0.1:10350: connect: connection refused

Expected
antrea-agent shouldn't get stuck in a "Ready: False" state upon restart, it should be able to recover.

Actual behavior
Sometimes, when antrea-ovs container is restarted multiple times it causes antrea-agent container to be forever stuck in "Ready: False" state, which of course leads to pod networking being down.

Please provide the following information:

  • Antrea version: v0.7.2
  • Kubernetes version: 1.15.4
  • Container runtime: Docker 18.0.6
  • Linux kernel version on the Kubernetes Nodes: 4.19.43-coreos

Additional context
wdc-prd-decc-002-md-dy-minion020-logs.zip

@alex-vmw alex-vmw added the bug label Jun 25, 2020
@antoninbas antoninbas self-assigned this Jun 26, 2020
antoninbas added a commit to antoninbas/antrea that referenced this issue Jun 26, 2020
From 0.4.1 to 0.4.5.
In version 0.4.1, no error is returned by go-iptables when running
`iptables --version` or parsing its ouput fails (during
initialization). This leads to the library not being able to correctly
detect whether the iptables version supports `--wait`, which ultimately
can lead to a deadlock for the Antrea agent.

See coreos/go-iptables#69.

By updating the go-iptables version, we ensure that any such error will
be returned to Antrea, logged, and cause the Antrea agent to fail and
eventually restart.

It is unclear what can cause iptables version detection to fail but
because of the added logging, we will have a better shot at getting to
the root cause if it happens in production again.

Fixes antrea-io#871
antoninbas added a commit to antoninbas/antrea that referenced this issue Jun 26, 2020
From 0.4.1 to 0.4.5.
In version 0.4.1, no error is returned by go-iptables when running
`iptables --version` or parsing its ouput fails (during
initialization). This leads to the library not being able to correctly
detect whether the iptables version supports `--wait`, which ultimately
can lead to a deadlock for the Antrea agent.

See coreos/go-iptables#69.

By updating the go-iptables version, we ensure that any such error will
be returned to Antrea, logged, and cause the Antrea agent to fail and
eventually restart.

It is unclear what can cause iptables version detection to fail but
because of the added logging, we will have a better shot at getting to
the root cause if it happens in production again.

Fixes antrea-io#871
@antoninbas antoninbas added the kind/bug Categorizes issue or PR as related to a bug. label Jun 26, 2020
antoninbas added a commit to antoninbas/antrea that referenced this issue Jun 26, 2020
From 0.4.1 to 0.4.5.
In version 0.4.1, no error is returned by go-iptables when running
`iptables --version` or parsing its ouput fails (during
initialization). This leads to the library not being able to correctly
detect whether the iptables version supports `--wait`, which ultimately
can lead to a deadlock for the Antrea agent.

See coreos/go-iptables#69.

By updating the go-iptables version, we ensure that any such error will
be returned to Antrea, logged, and cause the Antrea agent to fail and
eventually restart.

It is unclear what can cause iptables version detection to fail but
because of the added logging, we will have a better shot at getting to
the root cause if it happens in production again.

Fixes antrea-io#871
@alex-vmw
Copy link
Author

Attaching a antrea-agent process core file we collected during the troubleshooting session.
antrea-agent-core.zip

antoninbas added a commit to antoninbas/antrea that referenced this issue Jun 26, 2020
From 0.4.1 to 0.4.5.
In version 0.4.1, no error is returned by go-iptables when running
`iptables --version` or parsing its ouput fails (during
initialization). This leads to the library not being able to correctly
detect whether the iptables version supports `--wait`, which ultimately
can lead to a deadlock for the Antrea agent.

See coreos/go-iptables#69.

By updating the go-iptables version, we ensure that any such error will
be returned to Antrea, logged, and cause the Antrea agent to fail and
eventually restart.

It is unclear what can cause iptables version detection to fail but
because of the added logging, we will have a better shot at getting to
the root cause if it happens in production again.

Fixes antrea-io#871
@antoninbas
Copy link
Contributor

@alex-vmw Thanks. I opened an issue with the fix we discussed today.

antoninbas added a commit that referenced this issue Jun 29, 2020
From 0.4.1 to 0.4.5.
In version 0.4.1, no error is returned by go-iptables when running
`iptables --version` or parsing its ouput fails (during
initialization). This leads to the library not being able to correctly
detect whether the iptables version supports `--wait`, which ultimately
can lead to a deadlock for the Antrea agent.

See coreos/go-iptables#69.

By updating the go-iptables version, we ensure that any such error will
be returned to Antrea, logged, and cause the Antrea agent to fail and
eventually restart.

It is unclear what can cause iptables version detection to fail but
because of the added logging, we will have a better shot at getting to
the root cause if it happens in production again.

Fixes #871
GraysonWu pushed a commit to GraysonWu/antrea that referenced this issue Sep 22, 2020
From 0.4.1 to 0.4.5.
In version 0.4.1, no error is returned by go-iptables when running
`iptables --version` or parsing its ouput fails (during
initialization). This leads to the library not being able to correctly
detect whether the iptables version supports `--wait`, which ultimately
can lead to a deadlock for the Antrea agent.

See coreos/go-iptables#69.

By updating the go-iptables version, we ensure that any such error will
be returned to Antrea, logged, and cause the Antrea agent to fail and
eventually restart.

It is unclear what can cause iptables version detection to fail but
because of the added logging, we will have a better shot at getting to
the root cause if it happens in production again.

Fixes antrea-io#871
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants