Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write issue taskq shouldn't be dynamic #5236

Merged
merged 1 commit into from
Oct 10, 2016

Conversation

dweeezil
Copy link
Contributor

@dweeezil dweeezil commented Oct 6, 2016

This is as much an upstream compatibility as it's a bit of a performance
gain.

The illumos taskq implemention doesn't allow a TASKQ_THREADS_CPU_PCT type
to be dynamic and in fact enforces as much with an ASSERT.

As to performance, if this taskq is dynamic, it can cause excessive
contention on tq_lock as the threads are created and destroyed because it
can see bursts of many thousands of tasks in a short time, particularly
in heavy high-concurrency zvol write workloads.

This is as much an upstream compatibility as it's a bit of a performance
gain.

The illumos taskq implemention doesn't allow a TASKQ_THREADS_CPU_PCT type
to be dynamic and in fact enforces as much with an ASSERT.

As to performance, if this taskq is dynamic, it can cause excessive
contention on tq_lock as the threads are created and destroyed because it
can see bursts of many thousands of tasks in a short time, particularly
in heavy high-concurrency zvol write workloads.
@dweeezil
Copy link
Contributor Author

dweeezil commented Oct 6, 2016

I'd like to follow this up with a patch to fan out the write issue taskq on systems without 16K stacks. The zio dispatching in __zio_execute() can dramatically increase the contention on tq_lock, particularly with the same type of zvol workloads described in the commit message for this PR.

@behlendorf
Copy link
Contributor

@dweeezil do you have any performance or profiling data you could share on how bad this is?

@dweeezil
Copy link
Contributor Author

dweeezil commented Oct 7, 2016

@behlendorf I'm trying to get some solid numbers with lockdep and will post them when available.

tl;dr back-story: My initial observations were based on watching perf top -g and seeing 50-70% CPU utilization from raw_spin_lock_irq as called from taskq_thread() and one other place. Also from monitoring /proc/spl/taskq an seeing in some cases thousands and even hundreds of thousands of pending tasks, it is apparent there's a lot of pressure on this particular taskq. My gut feeling is that this one patch won't by itself make a huge difference. The big win will be the (not-yet-posted) patch to fan out the taskq on non-16k stack systems. Watching the perf output with both patches on a non-16k stack system showed the time in raw_spin_lock_irq due to tq_lock vanish but, unfortunately, only to be replaced by extreme contention on another lock in the kernel (forgot which at the moment, but it's the one involved with disabling set-uid on files being written). This was observed while trying to troubleshoot 3871 (intentionally not referenced) & friends. There have been anecdotal reports that disabling dynamic taskqs helps some peoples' workloads and I have to wonder if this might be the cause (also not yet investigated).

@mention-bot
Copy link

@dweeezil, thanks for your PR! By analyzing the history of the files in this pull request, we identified @behlendorf, @ahrens and @jas14 to be potential reviewers.

@dweeezil
Copy link
Contributor Author

dweeezil commented Oct 8, 2016

My suspicions were correct. Running an fio test:

[test]
    blocksize=8k
    scramble_buffers=1
    disk_util=0
    invalidate=0
    size=256m
    numjobs=32
    create_serialize=1
    direct=1
    filename=/dev/zvol/tank/v1
    offset=0
    offset_increment=10g
    group_reporting=1
    ioengine=libaio
    iodepth=10
    rw=write
    thread=1
    time_based=1
    runtime=1m
    fsync=0

which really exacerbates the issue, there were 917861 on tq_lock without this patch and it only dropped to 875989 with this patch. However, when I fanned out the write issue taskq a bit (simply made it 12x5), the contentions on tq_lock dropped to 32763. I did not record the actual fio results but they're not much different. Fixing this point of contention simply moves it elsewhere, unfortunately.

I still think this patch is good on its own. Among other things, it would allow us to use the same ASSERT in taskq_create that illumos has (disallowing cpu-proportioned taskqs to be dynamic).

The numbers above are for a stock 3.2 kernel which doesn't have 16K stacks.

I'm not sure what a good strategy would be yet to fan out the taskq in a mannter congruent with the concept of a cpu-proportioned taskq which is supposed to keep the whole system from melting down. The situation is quite different between low core count systems and larger ones.

@behlendorf
Copy link
Contributor

I'm OK with merging this small change as is even if it just moves the contention point since it's pretty clearly the right thing to do. Just let me know how you want to proceed.

I'm not sure what a good strategy would be yet to fan out the taskq in a manner congruent with the concept of a cpu-proportioned taskq which is supposed to keep the whole system from melting down. The situation is quite different between low core count systems and larger ones.

The first concern should probably be to verify that we can in fact fan this out safely. Are we going to run in to problems by dispatching to different taskqs? I don't foresee any issues here but it's something to keep in mind.

As for scaling this up perhaps it could be done by creating a taskq per cpu each with just a small number of threads. We could create a new ZTI_* type for this type of workload or change the existing ZTI_MODE_BATCH implementation since this is the only existing consumer.

@behlendorf behlendorf added the Type: Performance Performance improvement or performance problem label Oct 10, 2016
@dweeezil
Copy link
Contributor Author

@behlendorf I'd like to see this merged as-is as a reasonable and correct first step. A new ZTI type with a taskq per-cpu certainly makes sense but it's not "compatible" with the existing 75% scheme. Maybe it's not be such a big deal now that we're actually running the write issue threads at a (every so slightly) lower priority.

@behlendorf behlendorf merged commit d33931a into openzfs:master Oct 10, 2016
DeHackEd pushed a commit to DeHackEd/zfs that referenced this pull request Oct 19, 2016
This is as much an upstream compatibility as it's a bit of a performance
gain.

The illumos taskq implemention doesn't allow a TASKQ_THREADS_CPU_PCT type
to be dynamic and in fact enforces as much with an ASSERT.

As to performance, if this taskq is dynamic, it can cause excessive
contention on tq_lock as the threads are created and destroyed because it
can see bursts of many thousands of tasks in a short time, particularly
in heavy high-concurrency zvol write workloads.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Closes openzfs#5236
DeHackEd pushed a commit to DeHackEd/zfs that referenced this pull request Oct 29, 2016
This is as much an upstream compatibility as it's a bit of a performance
gain.

The illumos taskq implemention doesn't allow a TASKQ_THREADS_CPU_PCT type
to be dynamic and in fact enforces as much with an ASSERT.

As to performance, if this taskq is dynamic, it can cause excessive
contention on tq_lock as the threads are created and destroyed because it
can see bursts of many thousands of tasks in a short time, particularly
in heavy high-concurrency zvol write workloads.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Closes openzfs#5236
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Feb 2, 2017
This is as much an upstream compatibility as it's a bit of a performance
gain.

The illumos taskq implemention doesn't allow a TASKQ_THREADS_CPU_PCT type
to be dynamic and in fact enforces as much with an ASSERT.

As to performance, if this taskq is dynamic, it can cause excessive
contention on tq_lock as the threads are created and destroyed because it
can see bursts of many thousands of tasks in a short time, particularly
in heavy high-concurrency zvol write workloads.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Closes openzfs#5236
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Feb 2, 2017
This is as much an upstream compatibility as it's a bit of a performance
gain.

The illumos taskq implemention doesn't allow a TASKQ_THREADS_CPU_PCT type
to be dynamic and in fact enforces as much with an ASSERT.

As to performance, if this taskq is dynamic, it can cause excessive
contention on tq_lock as the threads are created and destroyed because it
can see bursts of many thousands of tasks in a short time, particularly
in heavy high-concurrency zvol write workloads.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Closes openzfs#5236
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Feb 2, 2017
This is as much an upstream compatibility as it's a bit of a performance
gain.

The illumos taskq implemention doesn't allow a TASKQ_THREADS_CPU_PCT type
to be dynamic and in fact enforces as much with an ASSERT.

As to performance, if this taskq is dynamic, it can cause excessive
contention on tq_lock as the threads are created and destroyed because it
can see bursts of many thousands of tasks in a short time, particularly
in heavy high-concurrency zvol write workloads.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Closes openzfs#5236
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Feb 2, 2017
This is as much an upstream compatibility as it's a bit of a performance
gain.

The illumos taskq implemention doesn't allow a TASKQ_THREADS_CPU_PCT type
to be dynamic and in fact enforces as much with an ASSERT.

As to performance, if this taskq is dynamic, it can cause excessive
contention on tq_lock as the threads are created and destroyed because it
can see bursts of many thousands of tasks in a short time, particularly
in heavy high-concurrency zvol write workloads.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Closes openzfs#5236
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Feb 2, 2017
This is as much an upstream compatibility as it's a bit of a performance
gain.

The illumos taskq implemention doesn't allow a TASKQ_THREADS_CPU_PCT type
to be dynamic and in fact enforces as much with an ASSERT.

As to performance, if this taskq is dynamic, it can cause excessive
contention on tq_lock as the threads are created and destroyed because it
can see bursts of many thousands of tasks in a short time, particularly
in heavy high-concurrency zvol write workloads.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Closes openzfs#5236
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Feb 2, 2017
This is as much an upstream compatibility as it's a bit of a performance
gain.

The illumos taskq implemention doesn't allow a TASKQ_THREADS_CPU_PCT type
to be dynamic and in fact enforces as much with an ASSERT.

As to performance, if this taskq is dynamic, it can cause excessive
contention on tq_lock as the threads are created and destroyed because it
can see bursts of many thousands of tasks in a short time, particularly
in heavy high-concurrency zvol write workloads.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Closes openzfs#5236
Requires-builders: style
behlendorf pushed a commit that referenced this pull request Feb 3, 2017
This is as much an upstream compatibility as it's a bit of a performance
gain.

The illumos taskq implemention doesn't allow a TASKQ_THREADS_CPU_PCT type
to be dynamic and in fact enforces as much with an ASSERT.

As to performance, if this taskq is dynamic, it can cause excessive
contention on tq_lock as the threads are created and destroyed because it
can see bursts of many thousands of tasks in a short time, particularly
in heavy high-concurrency zvol write workloads.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Closes #5236
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Performance Performance improvement or performance problem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants