-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenZFS 8585 - improve batching done in zil_commit() #6566
Conversation
@sempervictus this is a second attempt at porting what I previously had in #6337, and attempts to address the same performance issues as #6133 .. if you can, it'd be great if you could try this out in your test environment. I've fixed the prior issue with msync and the zfstest suite, and was able to successfully run this version through the test suite in a local VM. Additionally, I believe the build issues that you previously hit w.r.t. the new tracepoints have been fixed (tested and verified they worked in my VM). The OpenZFS PR contains links to data on the performance testing that I previously did for this change (on illumos); see the "Performance Testing" section of that PR's description: openzfs/openzfs#447 |
@prakashsurya: adding to the test queue, thanks for digging in so quickly. |
@behlendorf is the
|
@prakashsurya it hasn't been 100% reliable and we've opened #5510 to track the known failure. We see a failure maybe 1/50 runs on the 32-bit tester but haven't had a chance completely root cause and fix the issue. Let me run it through again and see if it's reproducible, below is the link to the previous test failures so we don't lose track of it. http://build.zfsonlinux.org/builders/Ubuntu%2014.04%20i686%20%28TEST%29/builds/3556 |
A bunch of zloop later, still haven't blown it up. Going to push to some metal today, see how it fares on real world ops. |
This is promising - two personal production systems now running with this PR, send/recv working, zvols being used for KVM instances. Contended writes are a lot better - double digit reduction in iowait with multiple VMS running FIO. Thank you @prakashsurya |
appreciate the feedback; thanks @sempervictus. |
@prakashsurya it looks like there's still a failure on the 32-bit testers which needs to be run down. I ran the tests a few times to make sure and it's reproducible. The test case
[edited] I was able to reproduce this issue by running the |
Thanks for the report. Briefly looking at the assertion, I don't know why that'd be occurring; I'll have to dig more into it and maybe set up an environment to reproduce it locally. It appears as though the write record passed into |
@prakashsurya: saw similar error in a long and laggy send/recv. Not sure if its the patch or a very slow recv host though - hung task didn't crash hosts. The original 64b p4 in there is not loving it. Shows how far chips have come in a decade I guess. |
While running some tests of this patch in a VM, I managed trip the following assertion:
this was running on a VM with 8 SSDs, 32 vCPUs, 512GB of RAM, and 512 FIO threads performing synchronous writes; I'm not yet sure what the root cause for this is. |
I believe I saw that on a prior revision of this PR. Couldn't repro however.
|
OK, that's good to know. I can reproduce this every time I run this workload on this system, so I hope to root cause it sometime this week (it's a shame not having |
We've had this running for a few weeks including a few production hosts backing cinder with pcie ssds for slog on spinning pools. Writes are definitely snappier and the stalls are greatly reduced under concurrent load. If there's no blocker outstanding, I think this should get merged for wider eval. |
Just to understand this a bit better: this doesn't apply with no dedicated ZIL device connected, right ? [from sempervictus' comment dedicated devices appear to be needed to benefit from the improvements] In any case those lower latencies and higher IOPS are impressive ! Thanks a lot ! |
@kernelOfTruth The "ZIL" cannot be removed, so I think you meant "SLOG" right? If so, yes, this should improve latency even with no SLOG connected. My performance tests done on illumos/delphixos were done without an SLOG, and showed improved latencies; see here for more details. @sempervictus There's two blockers which I haven't yet gotten around to fixing:
IMO, both of these should be investigated and fixed prior to landing. |
@prakashsurya yes, sorry I meant SLOG device, all of the time I was reading on the web about it only was called "ZIL device" [the following seems to be one article that uses the correct term https://www.ixsystems.com/blog/o-slog-not-slog-best-configure-zfs-intent-log/ ] Thanks |
@prakashsurya: regarding the crash you experienced, i'm rather suspicious of your hypervisor. Type 2 HVs are known for doing "the wrong thing" when it comes to IO, specifically as relating to synchronous semantics/commit order/other things sane people expect to work. VBox is most guilty of this, but any type 2 is suspect. Can you repro on ESXi? @kernelOfTruth: Sorry, wasn't my intent to say its required. We run plenty of SSD-only pools as well, but the majority of data serviced by our cinder-volume containers resides on mirrored spans with PCI-based SLOG mirrors and SAS SSD cache spans. Thats pretty much the most abusive consistent consumer we have (clouds dont sleep apparently), so i was referencing it as the high-water mark. |
@sempervictus no worries - now I know the internals a bit better :) Makes sense to employ the change where it matters the most - ZFS and the hardware must be beaten pretty heavily in that usage scenario - that'll (as shown) most likely also will reveal any existing, dormant issues (if applicable) 👍 |
@sempervictus For my performance testing, and the panic I linked to, I was using VMware for the hypervisor. It was the same server/hypervisor that I used for the tests that I ran on illumos.
|
@prakashsurya if you get a chance can you rebase this one more time on master. I'd like to see if the 32-bit fix from #6734, which was merged today, has any impact of the i686 test results. |
@behlendorf done. |
That's a bit better, no ASSERT. Those all look like known spurious test failures, which suggests the root cause for this issue may be aggravated by a low memory situation. I'll send the build through a few more times to see if we always get the same set of test case failures. |
To make sure I understand, you're saying the 32-bit failure was due to an unrelated bug (i.e. not in this patch) that's been fixed? (pending re-running the tests a few more times) |
Not exactly. The 32-bug, which was recently fixed, caused the ARC to miscalculate the amount of available free memory because it wasn't excluding free highmem pages. As a result the ARC wasn't shrinking when it needed to which put low memory stress on the rest of the system. This definitely resulted in OOM events and possibly memory allocation failures. How that ended up causing the ASSERT isn't clear. However, since resolving that issue I've been unable to reproduce the ASSERT reported above in any of my 32-bit testing. The additional buildbot runs (4 so far) also only showed known spurious unrelated ZTS failures. And in fact we did get one entirely clean test run. My suspicion is that the flaw isn't 32-bit specific but was simply easier to reproduce there do to this other problem. The ASSERT probably could happen on an x86_64 system too given the right circumstances. But since we don't actually know the root cause that's all really just guess work. @sempervictus @prakashsurya I'd be open to merging this for broader testing, assuming we're no longer able to reproduce either of the two above ASSERTs. And we can verify the merging this doesn't increase the frequency of those unrelated ZTS test failures. |
Does this PR change on-disk structures at all in the ZIL? Trying to import a pool which suffered a crash using master-only builds results in:
|
Good news: while my pool is dead as a doornail, the cause did not come from here. I rebuilt and retested with all the PRs/commits in my test stack atop master and seem to have localized this to the datto work around scrubs and the likely abandoned crypt fmt fix branch. I'll keep a long running zloop on this build to try and work out any kinks, but far as my issues go, i'm 99% sure this didnt cause them, just happened to be in play when the fan was hit. |
@sempervictus no ZIL format changes in this PR or the scrub work. If you do have any suspicions about the current scrub patches please let me know now since those will likely be merged this week. |
@prakashsurya I was looking at the In all of these rare cases it looks like the right thing to do is abort the TX and return NULL. What do you think about removing this ASSERT and updating the comment accordingly. Incidentally this isn't a new issue introduced by this PR, but since we've been focusing on the ZIL we may be just more aware of it.
|
@sempervictus I believe @prakashsurya's fix in e93b150 could explain you corrupt ZIL record. |
Could that result in the bad dva issue?
|
@sempervictus I believe so under an unlikely set of circumstances. You'd need to write a ZIL record which included a dva which was damaged due to the memory it was stored in being free'd and reused as described in e93b150. Then you'd have to replay that damaged ZIL record on mount. You'd need to be really really unlucky, but I think it could happen. |
Now that I have a working implementation, I was able to run the same performance tests that I used on illumos, to test this PR on Ubuntu 16.04. See here for the results. @behlendorf we can discuss the results in detail after the holiday, but the SSD numbers are a little concerning. I've copied the graphs of the percentage change in latency between master and this PR below (these are also contain in the jupyter notebooks which I linked to above): Using HDDs -- each
|
In the graphs, does the term "using" indicate for slog or for data?
|
@prakashsurya: the pool which keeps getting "murdolated" in #6856 is a 5 member RAIDZ1 pool consisting of 250G Samsung EVO 850s on a SAS HBA. A 5-wide 2-way mirror span running its disks and cache from the same HBA (SLOG are m2's mirrored on PCI for that pool), is unaffected during these events, as is a 4 member RAID1 1T EVO 8 50 pool (but that's seeing no IO, and is on separate datapaths internally). Its getting killed, repeatedly, on a build consisting of nothing but master (at time of build) and this PR. I just refreshed the system with the same thing built a few hours ago with your updates along with those that went into master, destroyed the unimportable pool (again), rebuilt it, reseeded it with the VMs and containers it needed from their last known good snap, and lost the thing again in a matter of days. Something's rotten in the state of Denmark, and i'm apparently down-wind. |
@sempervictus For my tests, I was not using an SLOG. "Using" simply means the disks used by the zpool itself; i.e. I had pools with 1 to 8 HDDs, and pools with 1 to 8 SSDs (no redundancy, no SLOG). As far as your failures.. you're still seeing problems with the latest code of this PR, with my follow-up fixes? Are the issues seen on ZIL replay (i.e. initial mount), or after the FS/dataset is mounted? |
@prakashsurya: i've had to pull this PR out of the stack because it seems to be causing two distinct issues: |
@sempervictus were you able to confirm that removing this PR from your stack resolved the two issue you mentioned?
It depends exactly what;'s wrong with the DVA but in general yes. |
I've been loosely following this issue, along with #6874, and it seems that the bad blkptrs are being written to the pool on systems which do have ECC memory. In that case, it might be useful to put some calls to |
@dweeezil good thought. For debugging it might be useful to additionally verify the blkptr in |
@behlendorf I rebased this onto the latest master, and squashed to a single commit. The only failure I see is the following, from the 32-bit tester:
Is this a known flakey test? This doesn't seem related to my changes at first glance. |
@prakashsurya thanks! That's a known spurious failure, I've resubmitting that build. |
Authored by: Prakash Surya <[email protected]> Reviewed by: Brad Lewis <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Prakash Surya <[email protected]> Problem ======= The current implementation of zil_commit() can introduce significant latency, beyond what is inherent due to the latency of the underlying storage. The additional latency comes from two main problems: 1. When there's outstanding ZIL blocks being written (i.e. there's already a "writer thread" in progress), then any new calls to zil_commit() will block waiting for the currently oustanding ZIL blocks to complete. The blocks written for each "writer thread" is coined a "batch", and there can only ever be a single "batch" being written at a time. When a batch is being written, any new ZIL transactions will have to wait for the next batch to be written, which won't occur until the current batch finishes. As a result, the underlying storage may not be used as efficiently as possible. While "new" threads enter zil_commit() and are blocked waiting for the next batch, it's possible that the underlying storage isn't fully utilized by the current batch of ZIL blocks. In that case, it'd be better to allow these new threads to generate (and issue) a new ZIL block, such that it could be serviced by the underlying storage concurrently with the other ZIL blocks that are being serviced. 2. Any call to zil_commit() must wait for all ZIL blocks in its "batch" to complete, prior to zil_commit() returning. The size of any given batch is proportional to the number of ZIL transaction in the queue at the time that the batch starts processing the queue; which doesn't occur until the previous batch completes. Thus, if there's a lot of transactions in the queue, the batch could be composed of many ZIL blocks, and each call to zil_commit() will have to wait for all of these writes to complete (even if the thread calling zil_commit() only cared about one of the transactions in the batch). To further complicate the situation, these two issues result in the following side effect: 3. If a given batch takes longer to complete than normal, this results in larger batch sizes, which then take longer to complete and further drive up the latency of zil_commit(). This can occur for a number of reasons, including (but not limited to): transient changes in the workload, and storage latency irregularites. Solution ======== The solution attempted by this change has the following goals: 1. no on-disk changes; maintain current on-disk format. 2. modify the "batch size" to be equal to the "ZIL block size". 3. allow new batches to be generated and issued to disk, while there's already batches being serviced by the disk. 4. allow zil_commit() to wait for as few ZIL blocks as possible. 5. use as few ZIL blocks as possible, for the same amount of ZIL transactions, without introducing significant latency to any individual ZIL transaction. i.e. use fewer, but larger, ZIL blocks. In theory, with these goals met, the new allgorithm will allow the following improvements: 1. new ZIL blocks can be generated and issued, while there's already oustanding ZIL blocks being serviced by the storage. 2. the latency of zil_commit() should be proportional to the underlying storage latency, rather than the incoming synchronous workload. Porting Notes ============= Due to the changes made in commit 119a394, the lifetime of an itx structure differs than in OpenZFS. Specifically, the itx structure is kept around until the data associated with the itx is considered to be safe on disk; this is so that the itx's callback can be called after the data is committed to stable storage. Since OpenZFS doesn't have this itx callback mechanism, it's able to destroy the itx structure immediately after the itx is committed to an lwb (before the lwb is written to disk). To support this difference, and to ensure the itx's callbacks can still be called after the itx's data is on disk, a few changes had to be made: * A list of itxs was added to the lwb structure. This list contains all of the itxs that have been committed to the lwb, such that the callbacks for these itxs can be called from zil_lwb_flush_vdevs_done(), after the data for the itxs is committed to disk. * A list of itxs was added on the stack of the zil_process_commit_list() function; the "nolwb_itxs" list. In some circumstances, an itx may not be committed to an lwb (e.g. if allocating the "next" ZIL block on disk fails), so this list is used to keep track of which itxs fall into this state, such that their callbacks can be called after the ZIL's writer pipeline is "stalled". * The logic to actually call the itx's callback was moved into the zil_itx_destroy() function. Since all consumers of zil_itx_destroy() were effectively performing the same logic (i.e. if callback is non-null, call the callback), it seemed like useful code cleanup to consolidate this logic into a single function. Additionally, the existing Linux tracepoint infrastructure dealing with the ZIL's probes and structures had to be updated to reflect these code changes. Specifically: * The "zil__cw1" and "zil__cw2" probes were removed, so they had to be removed from "trace_zil.h" as well. * Some of the zilog structure's fields were removed, which affected the tracepoint definitions of the structure. * New tracepoints had to be added for the following 3 new probes: * zil__process__commit__itx * zil__process__normal__itx * zil__commit__io__error OpenZFS-issue: https://www.illumos.org/issues/8585 OpenZFS-commit: openzfs/openzfs@5d95a3a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've rerun @prakashsurya's performance tests on a different system and those results are more inline with what was observed on Illumos. This change improves ZIL performance for the majority of workloads. The one exception, which can be further investigated after the merge, is a penalty for single threaded applications issuing sync IO at the maximum rate. During all of my testing I did not observe any functional issues. This LGTM.
Codecov Report
@@ Coverage Diff @@
## master #6566 +/- ##
=========================================
- Coverage 75.2% 75.2% -0.01%
=========================================
Files 296 298 +2
Lines 95200 95521 +321
=========================================
+ Hits 71599 71833 +234
- Misses 23601 23688 +87
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #6566 +/- ##
=========================================
- Coverage 75.2% 75.2% -0.01%
=========================================
Files 296 298 +2
Lines 95200 95521 +321
=========================================
+ Hits 71599 71833 +234
- Misses 23601 23688 +87
Continue to review full report at Codecov.
|
Authored by: Prakash Surya <[email protected]> Reviewed by: Brad Lewis <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Prakash Surya <[email protected]> Problem ======= The current implementation of zil_commit() can introduce significant latency, beyond what is inherent due to the latency of the underlying storage. The additional latency comes from two main problems: 1. When there's outstanding ZIL blocks being written (i.e. there's already a "writer thread" in progress), then any new calls to zil_commit() will block waiting for the currently oustanding ZIL blocks to complete. The blocks written for each "writer thread" is coined a "batch", and there can only ever be a single "batch" being written at a time. When a batch is being written, any new ZIL transactions will have to wait for the next batch to be written, which won't occur until the current batch finishes. As a result, the underlying storage may not be used as efficiently as possible. While "new" threads enter zil_commit() and are blocked waiting for the next batch, it's possible that the underlying storage isn't fully utilized by the current batch of ZIL blocks. In that case, it'd be better to allow these new threads to generate (and issue) a new ZIL block, such that it could be serviced by the underlying storage concurrently with the other ZIL blocks that are being serviced. 2. Any call to zil_commit() must wait for all ZIL blocks in its "batch" to complete, prior to zil_commit() returning. The size of any given batch is proportional to the number of ZIL transaction in the queue at the time that the batch starts processing the queue; which doesn't occur until the previous batch completes. Thus, if there's a lot of transactions in the queue, the batch could be composed of many ZIL blocks, and each call to zil_commit() will have to wait for all of these writes to complete (even if the thread calling zil_commit() only cared about one of the transactions in the batch). To further complicate the situation, these two issues result in the following side effect: 3. If a given batch takes longer to complete than normal, this results in larger batch sizes, which then take longer to complete and further drive up the latency of zil_commit(). This can occur for a number of reasons, including (but not limited to): transient changes in the workload, and storage latency irregularites. Solution ======== The solution attempted by this change has the following goals: 1. no on-disk changes; maintain current on-disk format. 2. modify the "batch size" to be equal to the "ZIL block size". 3. allow new batches to be generated and issued to disk, while there's already batches being serviced by the disk. 4. allow zil_commit() to wait for as few ZIL blocks as possible. 5. use as few ZIL blocks as possible, for the same amount of ZIL transactions, without introducing significant latency to any individual ZIL transaction. i.e. use fewer, but larger, ZIL blocks. In theory, with these goals met, the new allgorithm will allow the following improvements: 1. new ZIL blocks can be generated and issued, while there's already oustanding ZIL blocks being serviced by the storage. 2. the latency of zil_commit() should be proportional to the underlying storage latency, rather than the incoming synchronous workload. Porting Notes ============= Due to the changes made in commit 119a394, the lifetime of an itx structure differs than in OpenZFS. Specifically, the itx structure is kept around until the data associated with the itx is considered to be safe on disk; this is so that the itx's callback can be called after the data is committed to stable storage. Since OpenZFS doesn't have this itx callback mechanism, it's able to destroy the itx structure immediately after the itx is committed to an lwb (before the lwb is written to disk). To support this difference, and to ensure the itx's callbacks can still be called after the itx's data is on disk, a few changes had to be made: * A list of itxs was added to the lwb structure. This list contains all of the itxs that have been committed to the lwb, such that the callbacks for these itxs can be called from zil_lwb_flush_vdevs_done(), after the data for the itxs is committed to disk. * A list of itxs was added on the stack of the zil_process_commit_list() function; the "nolwb_itxs" list. In some circumstances, an itx may not be committed to an lwb (e.g. if allocating the "next" ZIL block on disk fails), so this list is used to keep track of which itxs fall into this state, such that their callbacks can be called after the ZIL's writer pipeline is "stalled". * The logic to actually call the itx's callback was moved into the zil_itx_destroy() function. Since all consumers of zil_itx_destroy() were effectively performing the same logic (i.e. if callback is non-null, call the callback), it seemed like useful code cleanup to consolidate this logic into a single function. Additionally, the existing Linux tracepoint infrastructure dealing with the ZIL's probes and structures had to be updated to reflect these code changes. Specifically: * The "zil__cw1" and "zil__cw2" probes were removed, so they had to be removed from "trace_zil.h" as well. * Some of the zilog structure's fields were removed, which affected the tracepoint definitions of the structure. * New tracepoints had to be added for the following 3 new probes: * zil__process__commit__itx * zil__process__normal__itx * zil__commit__io__error OpenZFS-issue: https://www.illumos.org/issues/8585 OpenZFS-commit: openzfs/openzfs@5d95a3a Closes openzfs#6566
Authored by: Prakash Surya <[email protected]> Reviewed by: Brad Lewis <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Prakash Surya <[email protected]> Problem ======= The current implementation of zil_commit() can introduce significant latency, beyond what is inherent due to the latency of the underlying storage. The additional latency comes from two main problems: 1. When there's outstanding ZIL blocks being written (i.e. there's already a "writer thread" in progress), then any new calls to zil_commit() will block waiting for the currently oustanding ZIL blocks to complete. The blocks written for each "writer thread" is coined a "batch", and there can only ever be a single "batch" being written at a time. When a batch is being written, any new ZIL transactions will have to wait for the next batch to be written, which won't occur until the current batch finishes. As a result, the underlying storage may not be used as efficiently as possible. While "new" threads enter zil_commit() and are blocked waiting for the next batch, it's possible that the underlying storage isn't fully utilized by the current batch of ZIL blocks. In that case, it'd be better to allow these new threads to generate (and issue) a new ZIL block, such that it could be serviced by the underlying storage concurrently with the other ZIL blocks that are being serviced. 2. Any call to zil_commit() must wait for all ZIL blocks in its "batch" to complete, prior to zil_commit() returning. The size of any given batch is proportional to the number of ZIL transaction in the queue at the time that the batch starts processing the queue; which doesn't occur until the previous batch completes. Thus, if there's a lot of transactions in the queue, the batch could be composed of many ZIL blocks, and each call to zil_commit() will have to wait for all of these writes to complete (even if the thread calling zil_commit() only cared about one of the transactions in the batch). To further complicate the situation, these two issues result in the following side effect: 3. If a given batch takes longer to complete than normal, this results in larger batch sizes, which then take longer to complete and further drive up the latency of zil_commit(). This can occur for a number of reasons, including (but not limited to): transient changes in the workload, and storage latency irregularites. Solution ======== The solution attempted by this change has the following goals: 1. no on-disk changes; maintain current on-disk format. 2. modify the "batch size" to be equal to the "ZIL block size". 3. allow new batches to be generated and issued to disk, while there's already batches being serviced by the disk. 4. allow zil_commit() to wait for as few ZIL blocks as possible. 5. use as few ZIL blocks as possible, for the same amount of ZIL transactions, without introducing significant latency to any individual ZIL transaction. i.e. use fewer, but larger, ZIL blocks. In theory, with these goals met, the new allgorithm will allow the following improvements: 1. new ZIL blocks can be generated and issued, while there's already oustanding ZIL blocks being serviced by the storage. 2. the latency of zil_commit() should be proportional to the underlying storage latency, rather than the incoming synchronous workload. Porting Notes ============= Due to the changes made in commit 119a394, the lifetime of an itx structure differs than in OpenZFS. Specifically, the itx structure is kept around until the data associated with the itx is considered to be safe on disk; this is so that the itx's callback can be called after the data is committed to stable storage. Since OpenZFS doesn't have this itx callback mechanism, it's able to destroy the itx structure immediately after the itx is committed to an lwb (before the lwb is written to disk). To support this difference, and to ensure the itx's callbacks can still be called after the itx's data is on disk, a few changes had to be made: * A list of itxs was added to the lwb structure. This list contains all of the itxs that have been committed to the lwb, such that the callbacks for these itxs can be called from zil_lwb_flush_vdevs_done(), after the data for the itxs is committed to disk. * A list of itxs was added on the stack of the zil_process_commit_list() function; the "nolwb_itxs" list. In some circumstances, an itx may not be committed to an lwb (e.g. if allocating the "next" ZIL block on disk fails), so this list is used to keep track of which itxs fall into this state, such that their callbacks can be called after the ZIL's writer pipeline is "stalled". * The logic to actually call the itx's callback was moved into the zil_itx_destroy() function. Since all consumers of zil_itx_destroy() were effectively performing the same logic (i.e. if callback is non-null, call the callback), it seemed like useful code cleanup to consolidate this logic into a single function. Additionally, the existing Linux tracepoint infrastructure dealing with the ZIL's probes and structures had to be updated to reflect these code changes. Specifically: * The "zil__cw1" and "zil__cw2" probes were removed, so they had to be removed from "trace_zil.h" as well. * Some of the zilog structure's fields were removed, which affected the tracepoint definitions of the structure. * New tracepoints had to be added for the following 3 new probes: * zil__process__commit__itx * zil__process__normal__itx * zil__commit__io__error OpenZFS-issue: https://www.illumos.org/issues/8585 OpenZFS-commit: openzfs/openzfs@5d95a3a Closes openzfs#6566
Authored by: Prakash Surya <[email protected]> Reviewed by: Brad Lewis <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Prakash Surya <[email protected]> Problem ======= The current implementation of zil_commit() can introduce significant latency, beyond what is inherent due to the latency of the underlying storage. The additional latency comes from two main problems: 1. When there's outstanding ZIL blocks being written (i.e. there's already a "writer thread" in progress), then any new calls to zil_commit() will block waiting for the currently oustanding ZIL blocks to complete. The blocks written for each "writer thread" is coined a "batch", and there can only ever be a single "batch" being written at a time. When a batch is being written, any new ZIL transactions will have to wait for the next batch to be written, which won't occur until the current batch finishes. As a result, the underlying storage may not be used as efficiently as possible. While "new" threads enter zil_commit() and are blocked waiting for the next batch, it's possible that the underlying storage isn't fully utilized by the current batch of ZIL blocks. In that case, it'd be better to allow these new threads to generate (and issue) a new ZIL block, such that it could be serviced by the underlying storage concurrently with the other ZIL blocks that are being serviced. 2. Any call to zil_commit() must wait for all ZIL blocks in its "batch" to complete, prior to zil_commit() returning. The size of any given batch is proportional to the number of ZIL transaction in the queue at the time that the batch starts processing the queue; which doesn't occur until the previous batch completes. Thus, if there's a lot of transactions in the queue, the batch could be composed of many ZIL blocks, and each call to zil_commit() will have to wait for all of these writes to complete (even if the thread calling zil_commit() only cared about one of the transactions in the batch). To further complicate the situation, these two issues result in the following side effect: 3. If a given batch takes longer to complete than normal, this results in larger batch sizes, which then take longer to complete and further drive up the latency of zil_commit(). This can occur for a number of reasons, including (but not limited to): transient changes in the workload, and storage latency irregularites. Solution ======== The solution attempted by this change has the following goals: 1. no on-disk changes; maintain current on-disk format. 2. modify the "batch size" to be equal to the "ZIL block size". 3. allow new batches to be generated and issued to disk, while there's already batches being serviced by the disk. 4. allow zil_commit() to wait for as few ZIL blocks as possible. 5. use as few ZIL blocks as possible, for the same amount of ZIL transactions, without introducing significant latency to any individual ZIL transaction. i.e. use fewer, but larger, ZIL blocks. In theory, with these goals met, the new allgorithm will allow the following improvements: 1. new ZIL blocks can be generated and issued, while there's already oustanding ZIL blocks being serviced by the storage. 2. the latency of zil_commit() should be proportional to the underlying storage latency, rather than the incoming synchronous workload. Porting Notes ============= Due to the changes made in commit 119a394, the lifetime of an itx structure differs than in OpenZFS. Specifically, the itx structure is kept around until the data associated with the itx is considered to be safe on disk; this is so that the itx's callback can be called after the data is committed to stable storage. Since OpenZFS doesn't have this itx callback mechanism, it's able to destroy the itx structure immediately after the itx is committed to an lwb (before the lwb is written to disk). To support this difference, and to ensure the itx's callbacks can still be called after the itx's data is on disk, a few changes had to be made: * A list of itxs was added to the lwb structure. This list contains all of the itxs that have been committed to the lwb, such that the callbacks for these itxs can be called from zil_lwb_flush_vdevs_done(), after the data for the itxs is committed to disk. * A list of itxs was added on the stack of the zil_process_commit_list() function; the "nolwb_itxs" list. In some circumstances, an itx may not be committed to an lwb (e.g. if allocating the "next" ZIL block on disk fails), so this list is used to keep track of which itxs fall into this state, such that their callbacks can be called after the ZIL's writer pipeline is "stalled". * The logic to actually call the itx's callback was moved into the zil_itx_destroy() function. Since all consumers of zil_itx_destroy() were effectively performing the same logic (i.e. if callback is non-null, call the callback), it seemed like useful code cleanup to consolidate this logic into a single function. Additionally, the existing Linux tracepoint infrastructure dealing with the ZIL's probes and structures had to be updated to reflect these code changes. Specifically: * The "zil__cw1" and "zil__cw2" probes were removed, so they had to be removed from "trace_zil.h" as well. * Some of the zilog structure's fields were removed, which affected the tracepoint definitions of the structure. * New tracepoints had to be added for the following 3 new probes: * zil__process__commit__itx * zil__process__normal__itx * zil__commit__io__error OpenZFS-issue: https://www.illumos.org/issues/8585 OpenZFS-commit: openzfs/openzfs@5d95a3a Closes openzfs#6566
Authored by: Prakash Surya [email protected]
Reviewed by: Brad Lewis [email protected]
Reviewed by: Matt Ahrens [email protected]
Reviewed by: George Wilson [email protected]
Ported-by: Prakash Surya [email protected]
Problem
The current implementation of zil_commit() can introduce significant
latency, beyond what is inherent due to the latency of the underlying
storage. The additional latency comes from two main problems:
When there's outstanding ZIL blocks being written (i.e. there's
already a "writer thread" in progress), then any new calls to
zil_commit() will block waiting for the currently oustanding ZIL
blocks to complete. The blocks written for each "writer thread" is
coined a "batch", and there can only ever be a single "batch" being
written at a time. When a batch is being written, any new ZIL
transactions will have to wait for the next batch to be written,
which won't occur until the current batch finishes.
As a result, the underlying storage may not be used as efficiently
as possible. While "new" threads enter zil_commit() and are blocked
waiting for the next batch, it's possible that the underlying
storage isn't fully utilized by the current batch of ZIL blocks. In
that case, it'd be better to allow these new threads to generate
(and issue) a new ZIL block, such that it could be serviced by the
underlying storage concurrently with the other ZIL blocks that are
being serviced.
Any call to zil_commit() must wait for all ZIL blocks in its "batch"
to complete, prior to zil_commit() returning. The size of any given
batch is proportional to the number of ZIL transaction in the queue
at the time that the batch starts processing the queue; which
doesn't occur until the previous batch completes. Thus, if there's a
lot of transactions in the queue, the batch could be composed of
many ZIL blocks, and each call to zil_commit() will have to wait for
all of these writes to complete (even if the thread calling
zil_commit() only cared about one of the transactions in the batch).
To further complicate the situation, these two issues result in the
following side effect:
in larger batch sizes, which then take longer to complete and
further drive up the latency of zil_commit(). This can occur for a
number of reasons, including (but not limited to): transient changes
in the workload, and storage latency irregularites.
Solution
The solution attempted by this change has the following goals:
already batches being serviced by the disk.
transactions, without introducing significant latency to any
individual ZIL transaction. i.e. use fewer, but larger, ZIL blocks.
In theory, with these goals met, the new allgorithm will allow the
following improvements:
oustanding ZIL blocks being serviced by the storage.
storage latency, rather than the incoming synchronous workload.
Porting Notes
Due to the changes made in commit 119a394, the lifetime of an itx
structure differs than in OpenZFS. Specifically, the itx structure is
kept around until the data associated with the itx is considered to be
safe on disk; this is so that the itx's callback can be called after the
data is committed to stable storage. Since OpenZFS doesn't have this itx
callback mechanism, it's able to destroy the itx structure immediately
after the itx is committed to an lwb (before the lwb is written to
disk).
To support this difference, and to ensure the itx's callbacks can still
be called after the itx's data is on disk, a few changes had to be made:
A list of itxs was added to the lwb structure. This list contains
all of the itxs that have been committed to the lwb, such that the
callbacks for these itxs can be called from zil_lwb_flush_vdevs_done(),
after the data for the itxs is committed to disk.
A list of itxs was added on the stack of the zil_process_commit_list()
function; the "nolwb_itxs" list. In some circumstances, an itx may
not be committed to an lwb (e.g. if allocating the "next" ZIL block
on disk fails), so this list is used to keep track of which itxs
fall into this state, such that their callbacks can be called after
the ZIL's writer pipeline is "stalled".
The logic to actually call the itx's callback was moved into the
zil_itx_destroy() function. Since all consumers of zil_itx_destroy()
were effectively performing the same logic (i.e. if callback is
non-null, call the callback), it seemed like useful code cleanup to
consolidate this logic into a single function.
Additionally, the existing Linux tracepoint infrastructure dealing with
the ZIL's probes and structures had to be updated to reflect these code
changes. Specifically:
The "zil__cw1" and "zil__cw2" probes were removed, so they had to be
removed from "trace_zil.h" as well.
Some of the zilog structure's fields were removed, which affected
the tracepoint definitions of the structure.
New tracepoints had to be added for the following 3 new probes:
OpenZFS-issue: https://www.illumos.org/issues/8585
OpenZFS-commit: openzfs/openzfs@5d95a3a