StripeClient: CLOSE_WAIT possibly causing servers crash after a few hours #850

fabiolnm · 2019-09-19T14:50:40Z

Software Versions

Ruby: 2.6.4
stripe-ruby gem: 5.1.1

Original problems

Too many open files @ rb_sysopen - /proc/version
failed to get urandom

Investigation

sudo lsof is showing hundreds of lines containing:

server_name>api-34-202-153-183.stripe.com:https (CLOSE_WAIT)

sudo lsof | grep CLOSE_WAIT | grep stripe | wc -l
10468

Reproducing CLOSE_WAIT in localhost

rails c

Stripe::Charge.retrieve charge_id, stripe_account: connected_account_id

Wait ~1 minute and run

lsof | grep CLOSE_WAIT

ruby       6216 fabio   17u     IPv4 0x4b0d201b3705f30f        0t0        TCP 192.168.0.12:51558->api-34-200-27-109.stripe.com:https (CLOSE_WAIT)

The process PID 6216 is the rails console:

fabio$ ps -ef | grep ruby
  501  6216  6782   0 10:58am ttys000    0:08.82 /Users/fabio/.rbenv/versions/2.6.4/bin/ruby bin/rails c

lsof keeps listing it until rails console is closed.

Related Google searches

The text was updated successfully, but these errors were encountered:

rebelvc · 2019-09-19T17:38:58Z

To add more context:
The crash is happening on our Sidekiq servers. We use Sidekiq to

Retrieve payment processing fees
Update meta tags of stripe payments.
Sometimes we process a large amount of these background calls when we run recurring donations.

ob-stripe · 2019-09-19T17:42:06Z

Thanks for the report. We were able to reproduce the issue and are looking into the best way to fix this. We'll post an update here as soon as possible.

brandur-stripe · 2019-09-19T17:57:24Z

Sorry about the trouble here @fabiolnm @rebelvc. Could you comment on what your concurrency set up looks like? Are you using multiple processes and/or multiple threads and/or multiple fibers?

We've got likely ideas for where the problem may be, but are trying to narrow it down.

fabiolnm · 2019-09-19T18:03:37Z

We usually have one Sidekiq instance with SIDEKIQ_CONCURRENCY set to 20 (threads).

Since yesterday, in order to mitigate the problem, we have 2 Sidekiq instances (different servers) with SIDEKIQ_CONCURRENCY set to 10, each.

brandur-stripe · 2019-09-19T18:46:24Z

Thanks! Just wanted to let you know that I'm working on a fix for this.

I think my original assumption in the design was correct in the way that a connection left in CLOSE_WAIT is reusable — if I start 10 threads, have them start making requests and waiting more than 60 seconds between them (so that the connection enters CLOSE_WAIT), each subsequent request seems to be correctly reusing the connection that was originally created for the thread. The number of connections is stable no matter how long I run the program.

However, one pattern that I realize now that I'm not handling is if threads are used ephemerally. i.e. If a new thread is spun up for every task, or even if threads are occasionally rotated through as old ones die, connections on no-longer-used threads will not be reclaimed.

I think the best thing to do would is to ensure that there's a cleanup mechanism active in the code that GCs connection managers that haven't seen use in some time. Most of the infrastructure for this change is already in place, so it should be relatively quick to write.

Introduces a system for garbage collecting connection managers in an attempt to solve #850. Previously, the number of connection managers (and by extension the number of connections that they were holding) would stay stable if a program used a stable number of threads. However, if threads were used disposably, the number of active connection managers out there could continue to grow unchecked, and each of those could be holding one or more dead connections which are no longer open, but still holding a file descriptor waiting to be unlinked in disposed of by Ruby's GC. This PR introduces a connection manager garbage collector that runs periodically whenever a new connection manager is created. Connection managers get a timestamp to indicate when they were last used, and the GC runs through each one and prunes any that haven't seen use within a certain threshold (currently, 300 seconds). This should have the effect of removing connection managers as they're not needed anymore, and thus resolving the socket leakage seen in #850. I had to make a couple implementation tweaks to get this working correctly. Namely: * The `StripeClient` class now tracks thread contexts instead of connection managers. This is so that when we're disposing of a connection manager, we can set `default_connection_manager` on its parent thread context to `nil` so that it's not still tracking a connection manager that we're trying to get rid of. * `StripeClient` instances can still be instantiated as before, but no longer internalize a reference to their own connection manager, instead falling back to the one in the current thread context. The rationale is that when trying to dispose of a connection manager, we'd also have to dispose of its reference in any outstanding `StripeClient` instances that might still be tracking it, and that starts to get a little unwieldy. I've left `#connection_manager` in place for backwards compatibility, but marked it as deprecated.

brandur-stripe · 2019-09-19T23:40:09Z

Alright, I believe I have a working patch in #851.

Introduces a system for garbage collecting connection managers in an attempt to solve #850. Previously, the number of connection managers (and by extension the number of connections that they were holding) would stay stable if a program used a stable number of threads. However, if threads were used disposably, the number of active connection managers out there could continue to grow unchecked, and each of those could be holding one or more dead connections which are no longer open, but still holding a file descriptor waiting to be unlinked in disposed of by Ruby's GC. This PR introduces a connection manager garbage collector that runs periodically whenever a new connection manager is created. Connection managers get a timestamp to indicate when they were last used, and the GC runs through each one and prunes any that haven't seen use within a certain threshold (currently, 300 seconds). This should have the effect of removing connection managers as they're not needed anymore, and thus resolving the socket leakage seen in #850. I had to make a couple implementation tweaks to get this working correctly. Namely: * The `StripeClient` class now tracks thread contexts instead of connection managers. This is so that when we're disposing of a connection manager, we can set `default_connection_manager` on its parent thread context to `nil` so that it's not still tracking a connection manager that we're trying to get rid of. * `StripeClient` instances can still be instantiated as before, but no longer internalize a reference to their own connection manager, instead falling back to the one in the current thread context. The rationale is that when trying to dispose of a connection manager, we'd also have to dispose of its reference in any outstanding `StripeClient` instances that might still be tracking it, and that starts to get a little unwieldy. I've left `#connection_manager` in place for backwards compatibility, but marked it as deprecated.

Introduces a system for garbage collecting connection managers in an attempt to solve #850. Previously, the number of connection managers (and by extension the number of connections that they were holding) would stay stable if a program used a stable number of threads. However, if threads were used disposably, the number of active connection managers out there could continue to grow unchecked, and each of those could be holding one or more dead connections which are no longer open, but still holding a file descriptor waiting to be unlinked in disposed of by Ruby's GC. This PR introduces a connection manager garbage collector that runs periodically whenever a new connection manager is created. Connection managers get a timestamp to indicate when they were last used, and the GC runs through each one and prunes any that haven't seen use within a certain threshold (currently, 120 seconds). This should have the effect of removing connection managers as they're not needed anymore, and thus resolving the socket leakage seen in #850. I had to make a couple implementation tweaks to get this working correctly. Namely: * The `StripeClient` class now tracks thread contexts instead of connection managers. This is so that when we're disposing of a connection manager, we can set `default_connection_manager` on its parent thread context to `nil` so that it's not still tracking a connection manager that we're trying to get rid of. * `StripeClient` instances can still be instantiated as before, but no longer internalize a reference to their own connection manager, instead falling back to the one in the current thread context. The rationale is that when trying to dispose of a connection manager, we'd also have to dispose of its reference in any outstanding `StripeClient` instances that might still be tracking it, and that starts to get a little unwieldy. I've left `#connection_manager` in place for backwards compatibility, but marked it as deprecated.

brandur-stripe · 2019-09-20T06:48:58Z

@fabiolnm @rebelvc I just released what I think is a fix in 5.2.0.

I wrote a program that spins up and then churns out many threads over an extended period of time, and was able to verify that the number of connection managers (and by extension, connections) being managed by StripeClient stayed stable.

Based on a review of the code, I think this is likely to address the problem you're seeing, but I'm not absolutely sure, so we could use your help verifying it. Would you mind updating to the new version and let us know what you're seeing?

rebelvc · 2019-09-20T18:47:58Z

@brandur-stripe thank you so much for this fix. We have deployed it on production. Before the fix, we have increased the number of sidekiq server so each one has a less chance of overloading.

Now i ran sudo lsof | grep CLOSE_WAIT | grep stripe | wc -l
the result is 203 instead of in the thousands previously.

brandur-stripe · 2019-09-20T20:16:22Z

@rebelvc Excellent, thanks for checking! That number you're seeing should also stay relatively stable over time (maybe not exactly stable, but pretty level), so let me know if you're seeing something else.

One other thing I'll note for clarity: just seeing CLOSE_WAIT isn't necessarily a bad thing — that means we have a closed connection because it's been idle for too long, but it can still be potentially reused if the owning thread wakes up and issues a new request. Even if each thread is making requests to api.stripe.com fairly actively, it may still be holding a connection to files.stripe.com or connect.stripe.com that's used much less frequently. So some of those 203 CLOSE_WAIT you're seeing may still be on threads that can still do work over those connections if the need arises.

Going to close this out for now, but thanks for reporting and the quick feedback.

brandur mentioned this issue Sep 19, 2019

Introduce system for garbage collecting connection managers #851

Merged

brandur-stripe closed this as completed Sep 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StripeClient: CLOSE_WAIT possibly causing servers crash after a few hours #850

StripeClient: CLOSE_WAIT possibly causing servers crash after a few hours #850

fabiolnm commented Sep 19, 2019

rebelvc commented Sep 19, 2019

ob-stripe commented Sep 19, 2019

brandur-stripe commented Sep 19, 2019

fabiolnm commented Sep 19, 2019 •

edited

Loading

brandur-stripe commented Sep 19, 2019

brandur-stripe commented Sep 19, 2019

brandur-stripe commented Sep 20, 2019

rebelvc commented Sep 20, 2019

brandur-stripe commented Sep 20, 2019

StripeClient: CLOSE_WAIT possibly causing servers crash after a few hours #850

StripeClient: CLOSE_WAIT possibly causing servers crash after a few hours #850

Comments

fabiolnm commented Sep 19, 2019

Software Versions

Original problems

Investigation

Reproducing CLOSE_WAIT in localhost

Related Google searches

rebelvc commented Sep 19, 2019

ob-stripe commented Sep 19, 2019

brandur-stripe commented Sep 19, 2019

fabiolnm commented Sep 19, 2019 • edited Loading

brandur-stripe commented Sep 19, 2019

brandur-stripe commented Sep 19, 2019

brandur-stripe commented Sep 20, 2019

rebelvc commented Sep 20, 2019

brandur-stripe commented Sep 20, 2019

fabiolnm commented Sep 19, 2019 •

edited

Loading