-
Notifications
You must be signed in to change notification settings - Fork 300
Fix reconnect logging and timing depending on error type #523
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Tested invalid haproxy config (based on the other #481 ), which looks like it behaves correctly. Don't have a test case for the timeout issue, but the logic all looks correct to me. If no feedback from @jdef or @GoelDeepak by Thursday evening, I'll merge (unless you wanna go ahead and merge now). |
Also, once this is merged, I'll cut a release (1.11.2) |
Also of note: I also fixed a bunch of flake8 problems introduced by the validation PR - not sure why they didn't fail for that. I think pyflakes got updated and now checks for additional things like "bare excepts." Another fix was adding |
marathon_lb.py
Outdated
logger.info(m.format(e_msg)) | ||
currentWaitSeconds = 0 | ||
else: | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this raise
trigger a reconnect? I'm not familiar with python exception handling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch! This would actually exit the loop completely and not get caught by the following except
. The intended behavior would be to log the exception and then reconnect with backoff as normal.
marathon_lb.py
Outdated
# 'Operation too slow. Less than 1 bytes/sec transferred | ||
# the last 300 seconds' | ||
# In this case we should immediately reconnect | ||
# without a backoff. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just trying to understand the reasoning for reconnecting immediately. How will it help the issue? Why not wait for the backoff wait time to kick in to allow the congestion to dissipate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error number 28 happens when there is no activity on the marathon event stream for 5 minutes. We're reconnecting immediately in case the connection to marathon died silently so that we miss as few events as possible (see https://github.com/mesosphere/marathon-lb/blob/master/utils.py#L161-L170).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah !! This makes a lot more sense now. Probably rewrite the comment to state what you just did in explicit terms. "Already waited for 5 minutes so we need to attempt a reconnection immediately".
|
||
return apps | ||
|
||
except requests.exceptions.ConnectionError as e: | ||
logger.error("Connection error({0}): {1}".format( | ||
e.errno, e.strerror)) | ||
except: | ||
except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be just logging the exception as a string else won't be able to capture the reason. Same goes for other instances of this exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The line logger.exception("Unexpected error!")
right after this one will actually print the exception as well as a stack trace if executed from within an exception context (as this is).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it !!
@jdef @asridharan @justinrlee @GoelDeepak please check 9493c48 - I've fixed a bug I noticed while working on this and testing it. The |
marathon_lb.py
Outdated
# the last 300 seconds' | ||
# This happens when there is no activity on the marathon | ||
# event stream for the last 5 minutes. In this case we | ||
# should immediately in case the connection to marathon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/immediately/immediately reconnect
…e#523) * Fix reconnect logging and timing depending on error type * cleanup flake8 errors * add missing build-essential * start at most one thread and close connection when reconnecting
* Change README.md (d2iq-archive#487) Proposed change to README.md to warn users about use of zdd.py on a production environment. * Add container syslogd support and fix 'Waiting for Pids' race condition (d2iq-archive#505) * Add additional debug logs in reload * Adding syslog support * Make syslogd and retry reload configurable * Update documentation * Make infinite retries optional, run syslogd with runsvdir * Update base container to resolve CVE-2017-12424 (d2iq-archive#506) * Update base container to resolve CVE-2017-12424 Updated base container to debian:buster to resolve this attack vector. https://security-tracker.debian.org/tracker/CVE-2017-12424 * delete unnecessary swp file * Fix grammar in tini section of README.md (d2iq-archive#442) * exclude bad apps from generated config (d2iq-archive#481) exclude bad apps from generated config * Fix reconnect logging and timing depending on error type (d2iq-archive#523) * Fix reconnect logging and timing depending on error type * cleanup flake8 errors * add missing build-essential * start at most one thread and close connection when reconnecting * Filter events on server side, in order to decrease load on Marathon (d2iq-archive#541) * Filter events on server side, in order to decrease load on Marathon * Use lightweight Marathon events. * Update official Marathon-LB version to 1.11.3 * Fix error with bash run script, percentage symbol was not escaped
Hello guys I'm facing this error message in marathon logs. |
Addresses the problem of continuously throwing the following error:
The fix is to reconnect immediately when this error is encountered. The changes also include a slight fix to the logging - it was not accurately displaying the number of seconds to wait previously.
After the fix, the log output for the above case is: