Skip to content
This repository was archived by the owner on Dec 4, 2024. It is now read-only.

Fix reconnect logging and timing depending on error type #523

Merged
merged 9 commits into from
Dec 8, 2017

Conversation

drewkerrigan
Copy link
Contributor

@drewkerrigan drewkerrigan commented Dec 6, 2017

Addresses the problem of continuously throwing the following error:

2017-11-03 05:31:36,031 marathon_lb: Caught exception
Traceback (most recent call last):
File "/marathon-lb/marathon_lb.py", line 1799, in <module>
process_sse_events(marathon, processor)
File "/marathon-lb/marathon_lb.py", line 1700, in process_sse_events
for event in events:
File "/marathon-lb/marathon_lb.py", line 239, in get_event_stream
for line in resp.iter_lines():
File "/marathon-lb/utils.py", line 241, in _split_lines_from_chunks
for chunk in chunks:
File "/marathon-lb/utils.py", line 226, in _iter_chunks
self._check_curl_errors()
File "/marathon-lb/utils.py", line 230, in _check_curl_errors
raise pycurl.error(*f[1:])
pycurl.error: (28, 'Operation too slow. Less than 1 bytes/sec transferred the last 300 seconds')
2017-11-03 05:31:36,043 marathon_lb: Reconnecting in 4.5s...

The fix is to reconnect immediately when this error is encountered. The changes also include a slight fix to the logging - it was not accurately displaying the number of seconds to wait previously.

After the fix, the log output for the above case is:

017-12-06 23:55:18,539 marathon_lb: Possible timeout detected: Operation too slow. Less than 1 bytes/sec transferred the last 300 seconds, reconnecting now...

Copy link
Contributor

@justinrlee justinrlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@justinrlee
Copy link
Contributor

Tested invalid haproxy config (based on the other #481 ), which looks like it behaves correctly. Don't have a test case for the timeout issue, but the logic all looks correct to me. If no feedback from @jdef or @GoelDeepak by Thursday evening, I'll merge (unless you wanna go ahead and merge now).

@justinrlee
Copy link
Contributor

Also, once this is merged, I'll cut a release (1.11.2)

@drewkerrigan
Copy link
Contributor Author

Also of note: I also fixed a bunch of flake8 problems introduced by the validation PR - not sure why they didn't fail for that. I think pyflakes got updated and now checks for additional things like "bare excepts."

Another fix was adding build-essential to the build deps because of this error: x86_64-linux-gnu-gcc: error: /usr/share/dpkg/no-pie-compile.specs: No such file or directory - That was preventing docker images from getting built.

marathon_lb.py Outdated
logger.info(m.format(e_msg))
currentWaitSeconds = 0
else:
raise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this raise trigger a reconnect? I'm not familiar with python exception handling

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! This would actually exit the loop completely and not get caught by the following except. The intended behavior would be to log the exception and then reconnect with backoff as normal.

marathon_lb.py Outdated
# 'Operation too slow. Less than 1 bytes/sec transferred
# the last 300 seconds'
# In this case we should immediately reconnect
# without a backoff.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just trying to understand the reasoning for reconnecting immediately. How will it help the issue? Why not wait for the backoff wait time to kick in to allow the congestion to dissipate?

Copy link
Contributor Author

@drewkerrigan drewkerrigan Dec 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error number 28 happens when there is no activity on the marathon event stream for 5 minutes. We're reconnecting immediately in case the connection to marathon died silently so that we miss as few events as possible (see https://github.com/mesosphere/marathon-lb/blob/master/utils.py#L161-L170).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah !! This makes a lot more sense now. Probably rewrite the comment to state what you just did in explicit terms. "Already waited for 5 minutes so we need to attempt a reconnection immediately".


return apps

except requests.exceptions.ConnectionError as e:
logger.error("Connection error({0}): {1}".format(
e.errno, e.strerror))
except:
except Exception:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be just logging the exception as a string else won't be able to capture the reason. Same goes for other instances of this exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line logger.exception("Unexpected error!") right after this one will actually print the exception as well as a stack trace if executed from within an exception context (as this is).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it !!

@drewkerrigan
Copy link
Contributor Author

@jdef @asridharan @justinrlee @GoelDeepak please check 9493c48 - I've fixed a bug I noticed while working on this and testing it. The processor.start() was creating a new thread on every reconnect. Additionally the previous connection to marathon event stream was not getting closed.

marathon_lb.py Outdated
# the last 300 seconds'
# This happens when there is no activity on the marathon
# event stream for the last 5 minutes. In this case we
# should immediately in case the connection to marathon

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/immediately/immediately reconnect

@drewkerrigan drewkerrigan merged commit f0c2433 into master Dec 8, 2017
manriquecms pushed a commit to manriquecms/marathon-lb that referenced this pull request Feb 7, 2018
…e#523)

* Fix reconnect logging and timing depending on error type

* cleanup flake8 errors

* add missing build-essential

* start at most one thread and close connection when reconnecting
manriquecms added a commit to Stratio/marathon-lb-sec that referenced this pull request Feb 8, 2018
* Change README.md (d2iq-archive#487)

Proposed change to README.md to warn users about use of zdd.py on a production environment.

* Add container syslogd support and fix 'Waiting for Pids' race condition (d2iq-archive#505)

* Add additional debug logs in reload
* Adding syslog support
* Make syslogd and retry reload configurable
* Update documentation
* Make infinite retries optional, run syslogd with runsvdir

* Update base container to resolve CVE-2017-12424 (d2iq-archive#506)

* Update base container to resolve CVE-2017-12424

Updated base container to debian:buster to resolve this attack vector.
https://security-tracker.debian.org/tracker/CVE-2017-12424

* delete unnecessary swp file

* Fix grammar in tini section of README.md (d2iq-archive#442)

* exclude bad apps from generated config (d2iq-archive#481)

exclude bad apps from generated config

* Fix reconnect logging and timing depending on error type (d2iq-archive#523)

* Fix reconnect logging and timing depending on error type

* cleanup flake8 errors

* add missing build-essential

* start at most one thread and close connection when reconnecting

* Filter events on server side, in order to decrease load on Marathon (d2iq-archive#541)

* Filter events on server side, in order to decrease load on Marathon
* Use lightweight Marathon events.

* Update official Marathon-LB version to 1.11.3

* Fix error with bash run script, percentage symbol was not escaped
@justinrlee justinrlee deleted the ack/connection-backoff branch April 27, 2018 13:58
@amostipan-softheme
Copy link

Hello guys I'm facing this error message in marathon logs.
Every 5 minutes it happens during this error my services do not respond.
What does wrong ? I have my application with failed request every 5 minutes, could you please suggets something here ?
@drewkerrigan @justinrlee @GoelDeepak @jdef @asridharan @

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants