-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
StripeClient: CLOSE_WAIT possibly causing servers crash after a few hours #850
Comments
To add more context:
|
Thanks for the report. We were able to reproduce the issue and are looking into the best way to fix this. We'll post an update here as soon as possible. |
We usually have one Sidekiq instance with SIDEKIQ_CONCURRENCY set to 20 (threads). Since yesterday, in order to mitigate the problem, we have 2 Sidekiq instances (different servers) with SIDEKIQ_CONCURRENCY set to 10, each. |
Thanks! Just wanted to let you know that I'm working on a fix for this. I think my original assumption in the design was correct in the way that a connection left in However, one pattern that I realize now that I'm not handling is if threads are used ephemerally. i.e. If a new thread is spun up for every task, or even if threads are occasionally rotated through as old ones die, connections on no-longer-used threads will not be reclaimed. I think the best thing to do would is to ensure that there's a cleanup mechanism active in the code that GCs connection managers that haven't seen use in some time. Most of the infrastructure for this change is already in place, so it should be relatively quick to write. |
Introduces a system for garbage collecting connection managers in an attempt to solve #850. Previously, the number of connection managers (and by extension the number of connections that they were holding) would stay stable if a program used a stable number of threads. However, if threads were used disposably, the number of active connection managers out there could continue to grow unchecked, and each of those could be holding one or more dead connections which are no longer open, but still holding a file descriptor waiting to be unlinked in disposed of by Ruby's GC. This PR introduces a connection manager garbage collector that runs periodically whenever a new connection manager is created. Connection managers get a timestamp to indicate when they were last used, and the GC runs through each one and prunes any that haven't seen use within a certain threshold (currently, 300 seconds). This should have the effect of removing connection managers as they're not needed anymore, and thus resolving the socket leakage seen in #850. I had to make a couple implementation tweaks to get this working correctly. Namely: * The `StripeClient` class now tracks thread contexts instead of connection managers. This is so that when we're disposing of a connection manager, we can set `default_connection_manager` on its parent thread context to `nil` so that it's not still tracking a connection manager that we're trying to get rid of. * `StripeClient` instances can still be instantiated as before, but no longer internalize a reference to their own connection manager, instead falling back to the one in the current thread context. The rationale is that when trying to dispose of a connection manager, we'd also have to dispose of its reference in any outstanding `StripeClient` instances that might still be tracking it, and that starts to get a little unwieldy. I've left `#connection_manager` in place for backwards compatibility, but marked it as deprecated.
Alright, I believe I have a working patch in #851. |
Introduces a system for garbage collecting connection managers in an attempt to solve #850. Previously, the number of connection managers (and by extension the number of connections that they were holding) would stay stable if a program used a stable number of threads. However, if threads were used disposably, the number of active connection managers out there could continue to grow unchecked, and each of those could be holding one or more dead connections which are no longer open, but still holding a file descriptor waiting to be unlinked in disposed of by Ruby's GC. This PR introduces a connection manager garbage collector that runs periodically whenever a new connection manager is created. Connection managers get a timestamp to indicate when they were last used, and the GC runs through each one and prunes any that haven't seen use within a certain threshold (currently, 300 seconds). This should have the effect of removing connection managers as they're not needed anymore, and thus resolving the socket leakage seen in #850. I had to make a couple implementation tweaks to get this working correctly. Namely: * The `StripeClient` class now tracks thread contexts instead of connection managers. This is so that when we're disposing of a connection manager, we can set `default_connection_manager` on its parent thread context to `nil` so that it's not still tracking a connection manager that we're trying to get rid of. * `StripeClient` instances can still be instantiated as before, but no longer internalize a reference to their own connection manager, instead falling back to the one in the current thread context. The rationale is that when trying to dispose of a connection manager, we'd also have to dispose of its reference in any outstanding `StripeClient` instances that might still be tracking it, and that starts to get a little unwieldy. I've left `#connection_manager` in place for backwards compatibility, but marked it as deprecated.
Introduces a system for garbage collecting connection managers in an attempt to solve #850. Previously, the number of connection managers (and by extension the number of connections that they were holding) would stay stable if a program used a stable number of threads. However, if threads were used disposably, the number of active connection managers out there could continue to grow unchecked, and each of those could be holding one or more dead connections which are no longer open, but still holding a file descriptor waiting to be unlinked in disposed of by Ruby's GC. This PR introduces a connection manager garbage collector that runs periodically whenever a new connection manager is created. Connection managers get a timestamp to indicate when they were last used, and the GC runs through each one and prunes any that haven't seen use within a certain threshold (currently, 120 seconds). This should have the effect of removing connection managers as they're not needed anymore, and thus resolving the socket leakage seen in #850. I had to make a couple implementation tweaks to get this working correctly. Namely: * The `StripeClient` class now tracks thread contexts instead of connection managers. This is so that when we're disposing of a connection manager, we can set `default_connection_manager` on its parent thread context to `nil` so that it's not still tracking a connection manager that we're trying to get rid of. * `StripeClient` instances can still be instantiated as before, but no longer internalize a reference to their own connection manager, instead falling back to the one in the current thread context. The rationale is that when trying to dispose of a connection manager, we'd also have to dispose of its reference in any outstanding `StripeClient` instances that might still be tracking it, and that starts to get a little unwieldy. I've left `#connection_manager` in place for backwards compatibility, but marked it as deprecated.
Introduces a system for garbage collecting connection managers in an attempt to solve #850. Previously, the number of connection managers (and by extension the number of connections that they were holding) would stay stable if a program used a stable number of threads. However, if threads were used disposably, the number of active connection managers out there could continue to grow unchecked, and each of those could be holding one or more dead connections which are no longer open, but still holding a file descriptor waiting to be unlinked in disposed of by Ruby's GC. This PR introduces a connection manager garbage collector that runs periodically whenever a new connection manager is created. Connection managers get a timestamp to indicate when they were last used, and the GC runs through each one and prunes any that haven't seen use within a certain threshold (currently, 120 seconds). This should have the effect of removing connection managers as they're not needed anymore, and thus resolving the socket leakage seen in #850. I had to make a couple implementation tweaks to get this working correctly. Namely: * The `StripeClient` class now tracks thread contexts instead of connection managers. This is so that when we're disposing of a connection manager, we can set `default_connection_manager` on its parent thread context to `nil` so that it's not still tracking a connection manager that we're trying to get rid of. * `StripeClient` instances can still be instantiated as before, but no longer internalize a reference to their own connection manager, instead falling back to the one in the current thread context. The rationale is that when trying to dispose of a connection manager, we'd also have to dispose of its reference in any outstanding `StripeClient` instances that might still be tracking it, and that starts to get a little unwieldy. I've left `#connection_manager` in place for backwards compatibility, but marked it as deprecated.
Introduces a system for garbage collecting connection managers in an attempt to solve #850. Previously, the number of connection managers (and by extension the number of connections that they were holding) would stay stable if a program used a stable number of threads. However, if threads were used disposably, the number of active connection managers out there could continue to grow unchecked, and each of those could be holding one or more dead connections which are no longer open, but still holding a file descriptor waiting to be unlinked in disposed of by Ruby's GC. This PR introduces a connection manager garbage collector that runs periodically whenever a new connection manager is created. Connection managers get a timestamp to indicate when they were last used, and the GC runs through each one and prunes any that haven't seen use within a certain threshold (currently, 120 seconds). This should have the effect of removing connection managers as they're not needed anymore, and thus resolving the socket leakage seen in #850. I had to make a couple implementation tweaks to get this working correctly. Namely: * The `StripeClient` class now tracks thread contexts instead of connection managers. This is so that when we're disposing of a connection manager, we can set `default_connection_manager` on its parent thread context to `nil` so that it's not still tracking a connection manager that we're trying to get rid of. * `StripeClient` instances can still be instantiated as before, but no longer internalize a reference to their own connection manager, instead falling back to the one in the current thread context. The rationale is that when trying to dispose of a connection manager, we'd also have to dispose of its reference in any outstanding `StripeClient` instances that might still be tracking it, and that starts to get a little unwieldy. I've left `#connection_manager` in place for backwards compatibility, but marked it as deprecated.
@fabiolnm @rebelvc I just released what I think is a fix in 5.2.0. I wrote a program that spins up and then churns out many threads over an extended period of time, and was able to verify that the number of connection managers (and by extension, connections) being managed by Based on a review of the code, I think this is likely to address the problem you're seeing, but I'm not absolutely sure, so we could use your help verifying it. Would you mind updating to the new version and let us know what you're seeing? |
@brandur-stripe thank you so much for this fix. We have deployed it on production. Before the fix, we have increased the number of sidekiq server so each one has a less chance of overloading. Now i ran |
@rebelvc Excellent, thanks for checking! That number you're seeing should also stay relatively stable over time (maybe not exactly stable, but pretty level), so let me know if you're seeing something else. One other thing I'll note for clarity: just seeing Going to close this out for now, but thanks for reporting and the quick feedback. |
Software Versions
Original problems
Investigation
sudo lsof
is showing hundreds of lines containing:Reproducing CLOSE_WAIT in localhost
Wait ~1 minute and run
The process PID 6216 is the rails console:
lsof keeps listing it until rails console is closed.
Related Google searches
The text was updated successfully, but these errors were encountered: