Fix reconnect logging and timing depending on error type #523

drewkerrigan · 2017-12-06T22:11:01Z

Addresses the problem of continuously throwing the following error:

2017-11-03 05:31:36,031 marathon_lb: Caught exception
Traceback (most recent call last):
File "/marathon-lb/marathon_lb.py", line 1799, in <module>
process_sse_events(marathon, processor)
File "/marathon-lb/marathon_lb.py", line 1700, in process_sse_events
for event in events:
File "/marathon-lb/marathon_lb.py", line 239, in get_event_stream
for line in resp.iter_lines():
File "/marathon-lb/utils.py", line 241, in _split_lines_from_chunks
for chunk in chunks:
File "/marathon-lb/utils.py", line 226, in _iter_chunks
self._check_curl_errors()
File "/marathon-lb/utils.py", line 230, in _check_curl_errors
raise pycurl.error(*f[1:])
pycurl.error: (28, 'Operation too slow. Less than 1 bytes/sec transferred the last 300 seconds')
2017-11-03 05:31:36,043 marathon_lb: Reconnecting in 4.5s...

The fix is to reconnect immediately when this error is encountered. The changes also include a slight fix to the logging - it was not accurately displaying the number of seconds to wait previously.

After the fix, the log output for the above case is:

017-12-06 23:55:18,539 marathon_lb: Possible timeout detected: Operation too slow. Less than 1 bytes/sec transferred the last 300 seconds, reconnecting now...

justinrlee

Looks good to me!

justinrlee · 2017-12-07T04:13:07Z

Tested invalid haproxy config (based on the other #481 ), which looks like it behaves correctly. Don't have a test case for the timeout issue, but the logic all looks correct to me. If no feedback from @jdef or @GoelDeepak by Thursday evening, I'll merge (unless you wanna go ahead and merge now).

justinrlee · 2017-12-07T04:13:37Z

Also, once this is merged, I'll cut a release (1.11.2)

drewkerrigan · 2017-12-07T14:09:05Z

Also of note: I also fixed a bunch of flake8 problems introduced by the validation PR - not sure why they didn't fail for that. I think pyflakes got updated and now checks for additional things like "bare excepts."

Another fix was adding build-essential to the build deps because of this error: x86_64-linux-gnu-gcc: error: /usr/share/dpkg/no-pie-compile.specs: No such file or directory - That was preventing docker images from getting built.

jdef · 2017-12-07T14:54:10Z

marathon_lb.py

+                    logger.info(m.format(e_msg))
+                    currentWaitSeconds = 0
+                else:
+                    raise


does this raise trigger a reconnect? I'm not familiar with python exception handling

Great catch! This would actually exit the loop completely and not get caught by the following except. The intended behavior would be to log the exception and then reconnect with backoff as normal.

asridharan · 2017-12-07T01:13:06Z

marathon_lb.py

+                # 'Operation too slow. Less than 1 bytes/sec transferred
+                #  the last 300 seconds'
+                # In this case we should immediately reconnect
+                # without a backoff.


Just trying to understand the reasoning for reconnecting immediately. How will it help the issue? Why not wait for the backoff wait time to kick in to allow the congestion to dissipate?

The error number 28 happens when there is no activity on the marathon event stream for 5 minutes. We're reconnecting immediately in case the connection to marathon died silently so that we miss as few events as possible (see https://github.com/mesosphere/marathon-lb/blob/master/utils.py#L161-L170).

Ah !! This makes a lot more sense now. Probably rewrite the comment to state what you just did in explicit terms. "Already waited for 5 minutes so we need to attempt a reconnection immediately".

asridharan · 2017-12-07T16:49:02Z

marathon_lb.py


        return apps

    except requests.exceptions.ConnectionError as e:
        logger.error("Connection error({0}): {1}".format(
            e.errno, e.strerror))
-    except:
+    except Exception:


Should we be just logging the exception as a string else won't be able to capture the reason. Same goes for other instances of this exception.

The line logger.exception("Unexpected error!") right after this one will actually print the exception as well as a stack trace if executed from within an exception context (as this is).

drewkerrigan · 2017-12-07T20:25:37Z

@jdef @asridharan @justinrlee @GoelDeepak please check 9493c48 - I've fixed a bug I noticed while working on this and testing it. The processor.start() was creating a new thread on every reconnect. Additionally the previous connection to marathon event stream was not getting closed.

asridharan · 2017-12-08T19:16:42Z

marathon_lb.py

+                #  the last 300 seconds'
+                # This happens when there is no activity on the marathon
+                # event stream for the last 5 minutes. In this case we
+                # should immediately in case the connection to marathon


s/immediately/immediately reconnect

…e#523) * Fix reconnect logging and timing depending on error type * cleanup flake8 errors * add missing build-essential * start at most one thread and close connection when reconnecting

* Change README.md (d2iq-archive#487) Proposed change to README.md to warn users about use of zdd.py on a production environment. * Add container syslogd support and fix 'Waiting for Pids' race condition (d2iq-archive#505) * Add additional debug logs in reload * Adding syslog support * Make syslogd and retry reload configurable * Update documentation * Make infinite retries optional, run syslogd with runsvdir * Update base container to resolve CVE-2017-12424 (d2iq-archive#506) * Update base container to resolve CVE-2017-12424 Updated base container to debian:buster to resolve this attack vector. https://security-tracker.debian.org/tracker/CVE-2017-12424 * delete unnecessary swp file * Fix grammar in tini section of README.md (d2iq-archive#442) * exclude bad apps from generated config (d2iq-archive#481) exclude bad apps from generated config * Fix reconnect logging and timing depending on error type (d2iq-archive#523) * Fix reconnect logging and timing depending on error type * cleanup flake8 errors * add missing build-essential * start at most one thread and close connection when reconnecting * Filter events on server side, in order to decrease load on Marathon (d2iq-archive#541) * Filter events on server side, in order to decrease load on Marathon * Use lightweight Marathon events. * Update official Marathon-LB version to 1.11.3 * Fix error with bash run script, percentage symbol was not escaped

amostipan-softheme · 2018-10-30T11:24:19Z

Hello guys I'm facing this error message in marathon logs.
Every 5 minutes it happens during this error my services do not respond.
What does wrong ? I have my application with failed request every 5 minutes, could you please suggets something here ?
@drewkerrigan @justinrlee @GoelDeepak @jdef @asridharan @

drewkerrigan added 3 commits December 6, 2017 16:55

Fix reconnect logging and timing depending on error type

be20791

cleanup flake8 errors

f5cf27a

add missing build-essential

742f457

drewkerrigan requested review from jdef, GoelDeepak and justinrlee December 7, 2017 00:00

Address flake8 complaints

9c97863

justinrlee approved these changes Dec 7, 2017

View reviewed changes

jdef reviewed Dec 7, 2017

View reviewed changes

Removing raise

ea1e6bf

GoelDeepak approved these changes Dec 7, 2017

View reviewed changes

asridharan reviewed Dec 7, 2017

View reviewed changes

start at most one thread and close connection when reconnecting

9493c48

drewkerrigan added 2 commits December 8, 2017 14:05

Clarifying comment on reconnect

da9ddce

Close stream before sleep in reconnect

35c7dc7

asridharan approved these changes Dec 8, 2017

View reviewed changes

minor typo

6723942

drewkerrigan merged commit f0c2433 into master Dec 8, 2017

justinrlee deleted the ack/connection-backoff branch April 27, 2018 13:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix reconnect logging and timing depending on error type #523

Fix reconnect logging and timing depending on error type #523

drewkerrigan commented Dec 6, 2017 •

edited

Loading

justinrlee left a comment

justinrlee commented Dec 7, 2017

justinrlee commented Dec 7, 2017

drewkerrigan commented Dec 7, 2017

jdef Dec 7, 2017

drewkerrigan Dec 7, 2017

asridharan Dec 7, 2017

drewkerrigan Dec 7, 2017 •

edited

Loading

asridharan Dec 7, 2017

asridharan Dec 7, 2017

drewkerrigan Dec 7, 2017

asridharan Dec 7, 2017

drewkerrigan commented Dec 7, 2017

asridharan Dec 8, 2017

amostipan-softheme commented Oct 30, 2018

Fix reconnect logging and timing depending on error type #523

Fix reconnect logging and timing depending on error type #523

Conversation

drewkerrigan commented Dec 6, 2017 • edited Loading

justinrlee left a comment

Choose a reason for hiding this comment

justinrlee commented Dec 7, 2017

justinrlee commented Dec 7, 2017

drewkerrigan commented Dec 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drewkerrigan Dec 7, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drewkerrigan commented Dec 7, 2017

Choose a reason for hiding this comment

amostipan-softheme commented Oct 30, 2018

drewkerrigan commented Dec 6, 2017 •

edited

Loading

drewkerrigan Dec 7, 2017 •

edited

Loading