Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout waiting for ZVOL device to be created #9380

Merged
merged 1 commit into from
Oct 1, 2019
Merged

Timeout waiting for ZVOL device to be created #9380

merged 1 commit into from
Oct 1, 2019

Conversation

prakashsurya
Copy link
Member

We've seen cases where after creating a ZVOL, the ZVOL device node in
"/dev" isn't generated after 20 seconds of waiting, which is the point
at which our applications gives up on waiting and reports an error.

The workload when this occurs is to "refresh" 400+ ZVOLs roughly at the
same time, based on a policy set by the user. This refresh operation
will destroy the ZVOL, and re-create it based on a snapshot.

When this occurs, we see many hundreds of entries on the "z_zvol" taskq
(based on inspection of the /proc/spl/taskq-all file). Many of the
entries on the taskq end up in the "zvol_remove_minors_impl" function,
and I've measured the latency of that function:

Function = zvol_remove_minors_impl
msecs               : count     distribution
    0 -> 1          : 0        |                                        |
    2 -> 3          : 0        |                                        |
    4 -> 7          : 1        |                                        |
    8 -> 15         : 0        |                                        |
   16 -> 31         : 0        |                                        |
   32 -> 63         : 0        |                                        |
   64 -> 127        : 1        |                                        |
  128 -> 255        : 45       |****************************************|
  256 -> 511        : 5        |****                                    |

That data is from a 10 second sample, using the BCC "funclatency" tool.
As we can see, in this 10 second sample, most calls took 128ms at a
minimum. Thus, some basic math tells us that in any 20 second interval,
we could only process at most about 150 removals, which is much less
than the 400+ that'll occur based on the workload.

As a result of this, and since all ZVOL minor operations will go through
the single threaded "z_zvol" taskq, the latency for creating a single
ZVOL device can be unreasonably large due to other ZVOL activity on the
system. In our case, it's large enough to cause the application to
generate an error and fail the operation.

When profiling the "zvol_remove_minors_impl" function, I saw that most
of the time in the function was spent off-cpu, blocked in the function
"taskq_wait_outstanding". How this works, is "zvol_remove_minors_impl"
will dispatch calls to "zvol_free" using the "system_taskq", and then
the "taskq_wait_outstanding" function is used to wait for all of those
dispatched calls to occur before "zvol_remove_minors_impl" will return.

As far as I can tell, "zvol_remove_minors_impl" doesn't necessarily have
to wait for all calls to "zvol_free" to occur before it returns. Thus,
this change removes the call to "taskq_wait_oustanding", so that calls
to "zvol_free" don't affect the latency of "zvol_remove_minors_impl".

@prakashsurya
Copy link
Member Author

cc: @sdimitro @jgallag88

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Sep 30, 2019
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't necessarily have to wait for all calls to "zvol_free" to occur before it returns.

Since the entire removal process is asynchronous deferring the free should be OK.

module/zfs/zvol.c Show resolved Hide resolved
We've seen cases where after creating a ZVOL, the ZVOL device node in
"/dev" isn't generated after 20 seconds of waiting, which is the point
at which our applications gives up on waiting and reports an error.

The workload when this occurs is to "refresh" 400+ ZVOLs roughly at the
same time, based on a policy set by the user. This refresh operation
will destroy the ZVOL, and re-create it based on a snapshot.

When this occurs, we see many hundreds of entries on the "z_zvol" taskq
(based on inspection of the /proc/spl/taskq-all file). Many of the
entries on the taskq end up in the "zvol_remove_minors_impl" function,
and I've measured the latency of that function:

Function = zvol_remove_minors_impl
msecs               : count     distribution
  0 -> 1          : 0        |                                        |
  2 -> 3          : 0        |                                        |
  4 -> 7          : 1        |                                        |
  8 -> 15         : 0        |                                        |
 16 -> 31         : 0        |                                        |
 32 -> 63         : 0        |                                        |
 64 -> 127        : 1        |                                        |
128 -> 255        : 45       |****************************************|
256 -> 511        : 5        |****                                    |

That data is from a 10 second sample, using the BCC "funclatency" tool.
As we can see, in this 10 second sample, most calls took 128ms at a
minimum. Thus, some basic math tells us that in any 20 second interval,
we could only process at most about 150 removals, which is much less
than the 400+ that'll occur based on the workload.

As a result of this, and since all ZVOL minor operations will go through
the single threaded "z_zvol" taskq, the latency for creating a single
ZVOL device can be unreasonably large due to other ZVOL activity on the
system. In our case, it's large enough to cause the application to
generate an error and fail the operation.

When profiling the "zvol_remove_minors_impl" function, I saw that most
of the time in the function was spent off-cpu, blocked in the function
"taskq_wait_outstanding". How this works, is "zvol_remove_minors_impl"
will dispatch calls to "zvol_free" using the "system_taskq", and then
the "taskq_wait_outstanding" function is used to wait for all of those
dispatched calls to occur before "zvol_remove_minors_impl" will return.

As far as I can tell, "zvol_remove_minors_impl" doesn't necessarily have
to wait for all calls to "zvol_free" to occur before it returns. Thus,
this change removes the call to "taskq_wait_oustanding", so that calls
to "zvol_free" don't affect the latency of "zvol_remove_minors_impl".

Signed-off-by: Prakash Surya <[email protected]>
@prakashsurya
Copy link
Member Author

The failure in the buildbot/Debian 9 x86_64 (BUILD) test appears to be environmental, and not due to this change request:

checking for zlib.h... no
configure: error: in `/var/lib/buildbot/slaves/zfs/Debian_9_x86_64__BUILD_/build/zfs':
configure: error: 
	*** zlib.h missing, zlib-devel package required

@codecov
Copy link

codecov bot commented Oct 1, 2019

Codecov Report

Merging #9380 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9380      +/-   ##
==========================================
- Coverage   79.07%   79.06%   -0.01%     
==========================================
  Files         404      404              
  Lines      122539   122538       -1     
==========================================
- Hits        96892    96880      -12     
- Misses      25647    25658      +11
Flag Coverage Δ
#kernel 79.77% <100%> (ø) ⬆️
#user 66.67% <ø> (+0.33%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3283f13...5260983. Read the comment docs.

@prakashsurya
Copy link
Member Author

The failure in the buildbot/Ubuntu 18.04 x86_64 (TEST) test failed the test case trim/trim_config:

Tests with results other than PASS that are unexpected:
    FAIL trim/trim_config (expected PASS)

The test log shows this:

02:52:51.38 Failure /mnt/trim-vdev1 is 768 MB which is not -gt than 768 MB

I'm unfamiliar with this test, but I'd be surprised if my change is causing this. Please let me know if I should look more into this failure, otherwise I'll assume it's a flaky test.

@prakashsurya
Copy link
Member Author

@behlendorf anything more that I need to do for this?

@behlendorf
Copy link
Contributor

@prakashsurya nope, looks good. I'll get it merged, thanks!

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Oct 1, 2019
@behlendorf behlendorf merged commit 99573cc into openzfs:master Oct 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants