-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libct/cg/sd: reconnect and retry on dbus connection error #2923
Conversation
@wzshiming PTAL |
I was wrong in my initial analysis that this is a regression caused by PR #2203. In fact it's not, it has always been working that way (see e.g. https://bugzilla.redhat.com/show_bug.cgi?id=1634092 from 2018). |
[@kolyshkin: documentation nits] Signed-off-by: Shiming Zhang <[email protected]> Signed-off-by: Kir Kolyshkin <[email protected]>
Generalize isUnitExists as isDbusError, and use errors.As while at it (which can handle wrapped errors as well). Signed-off-by: Kir Kolyshkin <[email protected]>
[@kolyshkin: doc nits, use dbus.ErrClosed and isDbusError] Signed-off-by: Shiming Zhang <[email protected]> Signed-off-by: Kir Kolyshkin <[email protected]>
Signed-off-by: Shiming Zhang <[email protected]> Signed-off-by: Kir Kolyshkin <[email protected]>
Instead of reconnecting to dbus after some failed operations, and returning an error (so a caller has to retry), reconnect AND retry in place for all such operations. This should fix issues caused by a stale dbus connection after e.g. a dbus daemon restart. Signed-off-by: Kir Kolyshkin <[email protected]>
d.RLock() | ||
if conn := d.conn; conn != nil { | ||
d.RUnlock() | ||
return conn, nil | ||
} | ||
d.RUnlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be
d.RLock()
conn := d.conn
d.RUnlock()
if conn != nil { ... }
But it doesn't really matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
PR #2203 started to reuse the same dbus connection. While this improves the performance,In case the dbus daemon is ever restarted, the connection is no longer valid and every operation
fails. This is a minor concern for short-lived runc, but much more of a issue in case there is
a long-running daemon (e.g.
cri-o
) is using runc's libcontainer, as the connection is neverretried and the only remedy is to restart the daemon.
The solution to the above is to check the errors returned for
dbus: connection closed by user
error, and try to re-connect on that. This is what PR #2862 does.
This is a carry of #2862, implementing the idea of retry-in-place (first described
at #2862 (comment) and #2862 (comment)) on top of what it does.
For more info, see commit messages as well as #2862.
Fixes:
Changelog entry