-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linux: Cleanup taskq threads spawn/exit #15873
Conversation
This changes taskq_thread_should_stop() to limit maximum exit rate for idle threads to one per 5 seconds. I believe the previous one was broken, not allowing any thread exits for tasks arriving more than one at a time and so completing while others are running. Also while there: - Remove taskq_thread_spawn() calls on task allocation errors. - Remove extra taskq_thread_should_stop() call. Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc.
I'm curious if this was causing a specific bottleneck that you saw, or is just you reviewing things for cleanup? Was the 5s time picked based on some data about it changing the outcome? At least when I originally added it, the threshold was just picked based on experimental data of what number produced very little churn on my system, since I originally looked at this after noticing what fraction of my system's idle time was spent just creating and destroying threads transiently. |
@rincebrain We are now trying to understand 10-20% performance degradation of ZFS 2.2 vs 2.1 we see in certain heavy Samba to RAIDZ write workloads. Unfortunately profiling does not show anything meaningful, only that CPU time is distributed slightly differently between code parts, that makes me wonder if it can be caused by some changes in scheduling on top of SMT or cache topology. Looking through the large list of changed I spotted your commit, that made me to look deeper. On FreeBSD some time ago I patched kernel taskq implementation to not implement fair round-robin among all the taskq threads, using only minimally required count (LIFO vs FIFO), since in general case all taskq threads should be identical, so extra threads do not affect scheduling. Unfortunately I haven't found an easy way to do it on Linux, so I was looking on reducing the number of threads. Unfortunately I don't have a performance numbers yet to prove or otherwise my guess, our perf team is busy. So I am just going from my understanding that the current logic after your 35a6247 commit may not shrink taskqs on certain workloads, that may complicate scheduler job. 5s change from 10s is more of a feeling. Since we may have dozens if not hundreds of threads, one thread exit per 10s with the logic I propose may take a long time to shrink. Same time values below 5s may fluctuate too much within one TXG commit interval. |
We could also try adding more tracking of how many times we pass through this code and do/don't turnover per taskq to understand how much we're churning in different thread types, and go from there. (Not saying this should or shouldn't be done based on that, just speculating aloud how I'd try to isolate if that happening in some or all threads was related, for future more fine-grained options...) Does turning off this change affect the workloads observed? Because a value to the tunable of "0" should make my change moot, for testing purposes, I believe... |
Yes, that is exactly what I have asked our perf team. Was promised some results this week. But even if that is not the case of the slowdown we see, I still think your patch may not work as expected and propose to discuss mine. |
Sure, I don't think anything about this is unreasonable, and I'll mark it as reviewed, I just wanted to understand whether you had tried turning off that change. I don't particularly think the old patch should be able to cause many problems with teardown, since it should just come around and try again. My only question is, have you tested this on FBSD 13, since the buildbot seems to be broken on FBSD atm? (I realize this change shouldn't affect anything on FBSD, I swear, I just am really cautious about not assuming FBSD isn't broken given previous "oops we broke it" experiences.) |
As I have told, I believe that if you enqueue two or more tasks each time it will on each first completion reset tq->lastshouldstop = 0 and never allow other threads to exit after that.
I can not imagine how it may affect FreeBSD. That code is not shared. |
Ah, I misunderstood your description, I didn't understand from previous exchanges that you meant it triggered the =0 reset always happening. Thank you for clarifying. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good
@amotin is this relevant for 2.2? |
It's applicable, but given it's not critical and how fresh it is I'd like to let it soak for a bit before backporting. |
@rincebrain I've just got confirmation from our performance team that spl_taskq_thread_timeout_ms=0 fixes the performance issue. Now they should try this patch instead. |
Well, that's a useful data point. So if people think this is causing them problems, in some case, they can use that, at least, to restore status quo ante until this or some other patch rolls out with better behaviors. |
This changes taskq_thread_should_stop() to limit maximum exit rate for idle threads to one per 5 seconds. I believe the previous one was broken, not allowing any thread exits for tasks arriving more than one at a time and so completing while others are running. Also while there: - Remove taskq_thread_spawn() calls on task allocation errors. - Remove extra taskq_thread_should_stop() call. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rich Ercolani <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15873
This changes taskq_thread_should_stop() to limit maximum exit rate for idle threads to one per 5 seconds. I believe the previous one was broken, not allowing any thread exits for tasks arriving more than one at a time and so completing while others are running. Also while there: - Remove taskq_thread_spawn() calls on task allocation errors. - Remove extra taskq_thread_should_stop() call. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rich Ercolani <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15873
This changes taskq_thread_should_stop() to limit maximum exit rate for idle threads to one per 5 seconds. I believe the previous one was broken, not allowing any thread exits for tasks arriving more than one at a time and so completing while others are running. Also while there: - Remove taskq_thread_spawn() calls on task allocation errors. - Remove extra taskq_thread_should_stop() call. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rich Ercolani <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15873
This changes taskq_thread_should_stop() to limit maximum exit rate for idle threads to one per 5 seconds. I believe the previous one was broken, not allowing any thread exits for tasks arriving more than one at a time and so completing while others are running. Also while there: - Remove taskq_thread_spawn() calls on task allocation errors. - Remove extra taskq_thread_should_stop() call. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rich Ercolani <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15873
This changes taskq_thread_should_stop() to limit maximum exit rate for idle threads to one per 5 seconds. I believe the previous one was broken, not allowing any thread exits for tasks arriving more than one at a time and so completing while others are running. Also while there: - Remove taskq_thread_spawn() calls on task allocation errors. - Remove extra taskq_thread_should_stop() call. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rich Ercolani <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15873
This changes taskq_thread_should_stop() to limit maximum exit rate for idle threads to one per 5 seconds. I believe the previous one was broken, not allowing any thread exits for tasks arriving more than one at a time and so completing while others are running. Also while there: - Remove taskq_thread_spawn() calls on task allocation errors. - Remove extra taskq_thread_should_stop() call. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rich Ercolani <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15873
This changes taskq_thread_should_stop() to limit maximum exit rate for idle threads to one per 5 seconds. I believe the previous one was broken, not allowing any thread exits for tasks arriving more than one at a time and so completing while others are running.
Also while there:
Types of changes
Checklist:
Signed-off-by
.