-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: bitswap lock contention under high load #817
Conversation
Codecov ReportAttention: Patch coverage is
@@ Coverage Diff @@
## main #817 +/- ##
==========================================
- Coverage 60.49% 60.48% -0.01%
==========================================
Files 244 244
Lines 31079 31100 +21
==========================================
+ Hits 18800 18810 +10
- Misses 10603 10615 +12
+ Partials 1676 1675 -1
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For posterity, staging box looks really promising, the window of time when this was deployed to box 02 was significantly in better shape than kubo 0.32.1 (01 box):
HTTP success rate is higher too:
EOD for me, but I'll do more test tomorrow morning and see if any questions arise. Some quick ones inline.
b8730a5
to
c200764
Compare
c200764
to
bee11dc
Compare
d9d1313
to
8a27e39
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm, this is such improvement we should ship this as a patch release next week.
For posterity: based on our (Shipyard) staging tests the impact on high load providers is significant.
Below is a sample from HTTP gateway processing ~80 requests per second (mirrored organic cache-miss from ipfs.io). "01" is latest Kubo (0.33.0) without this fix, and "02" is with this fix (0.33.1):
Summary
Fix runaway goroutine creation under high load. Under high load conditions, goroutines are created faster than they can complete and the more goroutines creates the slower them complete. This creates a positive feedback cycle that ends in OOM. The fix dynamically adjusts message send scheduling to avoid the runaway condition.
Description of Lock Contention under High Load
The peermanager acquires the peermanager mutex, does peermanager stuff, and then acquires the messagequeue mutex for each peer to put wants/cancels on that peer's message queue. Nothing is blocked indefinitely, but all session goroutines wait on the peermanager mutex.
The messagequeue event loop for each peer is always running in a separate goroutine, waking up every time new data is added to the message queue. The messagequeue acquires the messagequeue mutex to check the amount of pending work and send a message if there is enough work.
The frequent lock/unlock of each messagequeue mutex delays each session goroutine from adding items to messagequeues, as they wait to acquire each peer's messagequeue mutex to enqueue a message. These delays cause the peermanager mutex to be held longer by each goroutine. When there are a sufficient number of peers and want requests, goroutines end up waiting on the peermanager mutex for a longer time, on average, that it takes for an additional request to arrive and start another goroutine. This leads to a positive feedback loop where the number of goroutines increases until their number alone is sufficient to cause OOM.
How this PR Fixes this
This PR avoids waking up the messagequeue event loop on every item added to the message queue, thus avoiding the high-frequency messagequeue mutex lock/unlock. Instead, the event loop wakes up after a delay, sends the accumulated work, then goes back to sleep for another delay. During the delay, wants and cancels are accumulated. This allows the session goroutines to add items to message queues without contending with the messagequeue event loop for the messagequeue mtuex.
The delay dynamically adjusts, between 20ms and 1 second, based on the number peers. The delay per peer is configurable, with a default of 1/8 millisecond (125us).