-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix several ztest failures #8010
Conversation
b0fcf41
to
c25607e
Compare
cmd/ztest/ztest.c
Outdated
* Wait for the delay timer while checking occasionally | ||
* if we should stop. | ||
*/ | ||
if (gethrtime() - last_run < delay) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: personally I find this easier to grok, if (gethrtime() > last_run + delay) { }
4b35b26
to
3200d25
Compare
cmd/ztest/ztest.c
Outdated
@@ -523,10 +523,20 @@ _umem_logging_init(void) | |||
static void | |||
dump_debug_buffer(void) | |||
{ | |||
if (!ztest_dump_debug_buffer) | |||
ssize_t ret; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's declare this as int ret __attribute__((unused));
, then you can drop the confusing (void) ret;
below and the last phrase of the comment. The attribute is supported by both clang
and gcc
.
module/zfs/zfs_debug.c
Outdated
|
||
/* | ||
* We use write() in this function instead of printf() | ||
* so it is sae to call from a signal handler. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/sae/safe
module/zfs/zfs_debug.c
Outdated
* GCC prints a warning if we ignore the return value | ||
* of write(), even if we cast the result to void. | ||
*/ | ||
(void) ret; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int ret __attribute__((unused));
ee90a59
to
157f98d
Compare
1f7f0bf
to
5c26583
Compare
147ec82
to
462cde0
Compare
Codecov Report
@@ Coverage Diff @@
## master #8010 +/- ##
==========================================
- Coverage 78.5% 78.42% -0.09%
==========================================
Files 377 377
Lines 114447 114488 +41
==========================================
- Hits 89850 89784 -66
- Misses 24597 24704 +107
Continue to review full report at Codecov.
|
d616c0a
to
802f844
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These all LGTM and I'd like to move forward with getting them merged. They don't address all the ztest
failures which have been observed but they do significantly improve the debugging and resolve several of the most common failures.
802f844 Fix waiting in ztest_device_removal()
462cde0 Fix ENXIO from spa_ld_verify_logs() in ztest
21b99f1 Fix ztest deadman panic with indirect vdev damage
ee92121 Fix issue with scanning dedup blocks as scan ends
8b58758 Fix lock inversion in txg_sync_thread()
93b50ed Fix dbgmsg printing in ztest and zdb
11f6d39 Fix ASSERT in zil_create() during ztest
6ce063b Fix random ztest_deadman_thread failures
As for this change we may find we still want to make it, or something similar, but additional investigation is still needed to get to the root cause.
d7dd8e9 ztest: convert active removal flag to lock
I'll give a review for this by tomorrow EOD the latest. One quick question though: |
@sdimitro none of the changes in this PR have been integrated yet, but several of them do fix issues which were introduced by recent features. In practice, I believe we've only encountered them with The goal is to get
I'm not sure what you mean by this. You can individually inspect the fixes by selecting each commit hash. They've been grouped in to this PR to make testing easier, I agree we don't want to squash them when merging. If it's helpful for review purposes I'm sure we can round up some examples of each of these failures. |
module/zfs/dsl_scan.c
Outdated
* completed. This check ensures we do not sio's during | ||
* this period. | ||
*/ | ||
if (scn->scn_done_txg != 0 && scn->scn_bytes_pending == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to check scn_bytes_pending?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to match the check in dsl_scan_sync()
. We don't necessarily need that checkdepending on what we want. Having it there basically just ensures that if the scan is running we still scan incoming dedup blocks, even if the scan if finishing. Note that the "finishing" part of scanning can still take several hours depending on the block sizes and RAM given to the dsl scan queues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which check in dsl_scan_sync()? And what exactly do you mean by "finishing"? Once scn_done_txg is set, could we still wait hours?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which check in dsl_scan_sync()?
else if (scn->scn_is_sorted && scn->scn_bytes_pending != 0) {
And what exactly do you mean by "finishing"? Once scn_done_txg is set, could we still wait hours?
Yes. scn_done_txg
is set when the last block of the scan is placed on one of the per-vdev scanning queues. The scan is not complete, however until all of the queued I/O has been issued. Back when I was testing this code, I saw this take about an hour on my test server, but I imagine it could be longer with other configurations. This check basically says "we won't add anything more to the queues after the normal scanning has completed adding its I/Os to the queues".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I wasn't that familiar with the changes made due to sorted scan here. Once scn_done_txg
is set, we've found all the blocks to scan (presumably including dedup blocks). Once that has happened, do we still need to do explicit scans in dsl_scan_ddt_entry()? I don't think so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. ill make the change.
@@ -7384,7 +7384,7 @@ main(int argc, char **argv) | |||
* segments can be reconstructed in a reasonable amount of time | |||
* when reconstruction is known to be possible. | |||
*/ | |||
zfs_reconstruct_indirect_damage_fraction = 4; | |||
zfs_reconstruct_indirect_damage_fraction = 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain this a bit more? why can't we do vdev_indirect_splits_damage() every time? Why is 1/100th of the time OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change was from @behlendorf. We were seeing several cases where reconstruction was taking a very long time. In fact, with 16 segments we were seeing ztest deadman timeouts. We wanted to reduce this workload so ztest has time to test other things so we lowered the max damaged segments to 12 and decreased how often they would occur.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, so this isn't an issue of correctness (too much damage causing it to be unrecoverable), just performance (too much damage causing us to spend too much time doing successful recovery).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly.
@@ -1632,7 +1632,7 @@ vdev_indirect_splits_damage(indirect_vsd_t *iv, zio_t *zio) | |||
} | |||
|
|||
iv->iv_attempts_max *= 2; | |||
if (iv->iv_attempts_max > (1ULL << 16)) { | |||
if (iv->iv_attempts_max >= (1ULL << 12)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this related to the ztest fix? Or is this just making it run faster by not damaging when it would cause more than 2^12 tries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above. with the existing code we can have 17 damaged segments which was causing the ztest deadman to trigger fairly often.
@@ -3529,13 +3528,9 @@ ztest_device_removal(ztest_ds_t *zd, uint64_t id) | |||
uint64_t guid; | |||
int error; | |||
|
|||
mutex_enter(&ztest_removal_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the old code was effectively a tryenter, do we want to use that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change was also from @behlendorf. I think the idea here was to make sure that this code path actually gets to run. As he mentioned, we might want to change this since a lot of these code paths are actually valid to race against each other in the actual kernel module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may, in fact as I mentioned in my comment, think we should hold off entirely on the boolean -> lock conversion patch for the moment. It's intention was to prevent racing between fault injection and removal, but we don't actually have clear evidence yet that it's needed.
@@ -533,10 +533,16 @@ _umem_logging_init(void) | |||
static void | |||
dump_debug_buffer(void) | |||
{ | |||
if (!ztest_dump_debug_buffer) | |||
ssize_t ret __attribute__((unused)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can discard the return value with (void)
as we do elsewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried that. GCC still complains. The only way to get it to shut up was to save the return value. 🤷♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the error from gcc? (void)
works in other places in this file (at least, for other functions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: ignoring return value of ‘ret’, declared with attribute warn_unused_result [-Wunused-result]
This is a discussion which goes way back https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66425. The short version is the behavior here isn't clearly defined by the standard and is compiler specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh. Any idea why all the other (void)
's in this file work fine, but it won't work here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has something to do with the write()
call and its attributes specifically, I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For several years now most of the standard glibc functions have had their prototypes updated to include the warn_unused_result function attribute, which forces the warning. Which is why we only see this in some places.
#define __wur __attribute_warn_unused_result__
extern ssize_t write (int __fd, const void *__buf, size_t __n) __wur;
edit: I should add that I'm not actually sure the other (void)
's in this file do anything, other than act as documentation. At least with gcc
.
@@ -866,7 +873,7 @@ process_options(int argc, char **argv) | |||
usage(B_FALSE); | |||
break; | |||
case 'G': | |||
ztest_dump_debug_buffer = B_TRUE; | |||
zo->zo_dump_dbgmsg = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you change this from a boolean_t to an int?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a minor thing. The variable in the struct is an int and I wanted to match zo->zo_mmp_test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you change the variable from a boolean_t to an int?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable needed to be moved into zo
so that it could properly be shared between processes. This struct did not have any boolean fields, even for fields such as zo_mmp_test
which are clearly logically boolean. I assumed this might have something to do with ensuring alignment so I was just following the precedent.
@@ -212,11 +215,28 @@ __dprintf(const char *file, const char *func, int line, const char *fmt, ...) | |||
void | |||
zfs_dbgmsg_print(const char *tag) | |||
{ | |||
(void) printf("ZFS_DBGMSG(%s):\n", tag); | |||
ssize_t ret __attribute__((unused)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(void)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above.
lib/libzpool/kernel.c
Outdated
|
||
/* | ||
* Get rid of trailing newline. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed? are we zfs_dbgmsg()-ing things that have newlines? Seems like we should fix that rather than changing the string after the fact (in some places but not others).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was taken directly from the zfs_debug.c code. I just wanted to ensure they behaved the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's fix zfs_debug.c too then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. I think I see why this was done. dprintf()
strings (by convention it seems) have a trailing newline while zfs_dbgmsg()
strings do not. After looking around, I don't see any examples of multi-line calls to either of these functions. Not sure what the best way to handle this is....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps move this to the if dprintf
section then, and change it to remove only a trailing newline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I can do that.
lib/libzpool/kernel.c
Outdated
*/ | ||
nl = strrchr(buf, '\n'); | ||
if (nl != NULL) | ||
*nl = '\0'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't just get rid of trailing newlines, it gets rid of the last newline, which may have more text after it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above.
Closing. All but 53cb0bf were integrated. |
The zloop test has been failing in buildbot for the last few weeks with various failures in ztest_deadman_thread(). This is due to the fact that this thread is not stopped when performing pool import / export tests as it should be. This patch simply corrects this. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes openzfs#8010
This patch corrects an ASSERT in zil_create() that will only be true if the call to zio_alloc_zil() does not fail. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes openzfs#8010
This patch resolves a problem where the -G option in both zdb and ztest would cause the code to call __dprintf() to print zfs_dbgmsg output. This function was not properly wired to add messages to the dbgmsg log as it is in userspace and so the messages were simply dropped. This patch also tries to add some degree of distinction to dprintf() (which now prints directly to stdout) and zfs_dbgmsg() (which adds messages to an internal list that can be dumped with zfs_dbgmsg_print()). In addition, this patch corrects an issue where ztest used a global variable to decide whether to dump the dbgmsg buffer on a crash. This did not work because ztest spins up more instances of itself using execv(), which did not copy the global variable to the new process. The option has been moved to the ztest_shared_opts_t which already exists for interprocess communication. This patch also changes zfs_dbgmsg_print() to use write() calls instead of printf() so that it will not fail when used in a signal handler. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes openzfs#8010
This patch fixes a lock inversion issue in txg_sync_thread() where the code would attempt hold the spa config lock as a reader while holding tx->tx_sync_lock. This races with spa_vdev_remove() which attempts to hold the tx->tx_sync_lock to assign a new tx (via spa_history_log_internal()) while holding the spa config lock as a writer. Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Co-authored-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes openzfs#8010
This patch fixes an issue discovered by ztest where dsl_scan_ddt_entry() could add I/Os to the dsl scan queues between when the scan had finished all required work and when the scan was marked as complete. This caused the scan to spin indefinitely without ending. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes openzfs#8010
This patch fixes an issue where ztest's deadman thread would trigger a panic because reconstructing artifically damaged blocks would take too long to reconstruct. This patch simply limits how often ztest inflicts split-block damage and how many segments it can damage when it does. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes openzfs#8010
This patch fixes a small issue where the zil_check_log_chain() code path would hit an EBUSY error. This would occur when 2 threads attempted to call metaslab_activate() at the same time. In this case, the "loser" would receive an error code which should have been ignored, but was instead floated to the caller. This ended up resulting in an ENXIO being returned from from spa_ld_verify_logs(). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes openzfs#8010
spa->spa_vdev_removal is created in a sync task that is initiated via dsl_sync_task_nowait(). Since the task may not run before spa_vdev_remove() returns, we must wait at least 1 txg to ensure that the removal struct has been created. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes openzfs#8010
This patch corrects 2 small bugs where scn->scn_phys_cached was not properly updated to match the primary copy when it needed to be. The first resulted in the pause state not being properly updated and the second resulted in the cached version being completely zeroed even if the primary was not. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes openzfs#8010
The zloop test has been failing in buildbot for the last few weeks with various failures in ztest_deadman_thread(). This is due to the fact that this thread is not stopped when performing pool import / export tests as it should be. This patch simply corrects this. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes openzfs#8010
This patch corrects an ASSERT in zil_create() that will only be true if the call to zio_alloc_zil() does not fail. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes openzfs#8010
This patch resolves a problem where the -G option in both zdb and ztest would cause the code to call __dprintf() to print zfs_dbgmsg output. This function was not properly wired to add messages to the dbgmsg log as it is in userspace and so the messages were simply dropped. This patch also tries to add some degree of distinction to dprintf() (which now prints directly to stdout) and zfs_dbgmsg() (which adds messages to an internal list that can be dumped with zfs_dbgmsg_print()). In addition, this patch corrects an issue where ztest used a global variable to decide whether to dump the dbgmsg buffer on a crash. This did not work because ztest spins up more instances of itself using execv(), which did not copy the global variable to the new process. The option has been moved to the ztest_shared_opts_t which already exists for interprocess communication. This patch also changes zfs_dbgmsg_print() to use write() calls instead of printf() so that it will not fail when used in a signal handler. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes openzfs#8010
This patch fixes a lock inversion issue in txg_sync_thread() where the code would attempt hold the spa config lock as a reader while holding tx->tx_sync_lock. This races with spa_vdev_remove() which attempts to hold the tx->tx_sync_lock to assign a new tx (via spa_history_log_internal()) while holding the spa config lock as a writer. Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Co-authored-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes openzfs#8010
This patch fixes an issue discovered by ztest where dsl_scan_ddt_entry() could add I/Os to the dsl scan queues between when the scan had finished all required work and when the scan was marked as complete. This caused the scan to spin indefinitely without ending. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes openzfs#8010
This patch fixes an issue where ztest's deadman thread would trigger a panic because reconstructing artifically damaged blocks would take too long to reconstruct. This patch simply limits how often ztest inflicts split-block damage and how many segments it can damage when it does. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes openzfs#8010
This patch fixes a small issue where the zil_check_log_chain() code path would hit an EBUSY error. This would occur when 2 threads attempted to call metaslab_activate() at the same time. In this case, the "loser" would receive an error code which should have been ignored, but was instead floated to the caller. This ended up resulting in an ENXIO being returned from from spa_ld_verify_logs(). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes openzfs#8010
spa->spa_vdev_removal is created in a sync task that is initiated via dsl_sync_task_nowait(). Since the task may not run before spa_vdev_remove() returns, we must wait at least 1 txg to ensure that the removal struct has been created. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes openzfs#8010
This patch corrects 2 small bugs where scn->scn_phys_cached was not properly updated to match the primary copy when it needed to be. The first resulted in the pause state not being properly updated and the second resulted in the cached version being completely zeroed even if the primary was not. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes openzfs#8010
The zloop test has been failing in buildbot for the last few weeks with various failures in ztest_deadman_thread(). This is due to the fact that this thread is not stopped when performing pool import / export tests as it should be. This patch simply corrects this. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes openzfs#8010
This patch corrects an ASSERT in zil_create() that will only be true if the call to zio_alloc_zil() does not fail. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes openzfs#8010
This patch resolves a problem where the -G option in both zdb and ztest would cause the code to call __dprintf() to print zfs_dbgmsg output. This function was not properly wired to add messages to the dbgmsg log as it is in userspace and so the messages were simply dropped. This patch also tries to add some degree of distinction to dprintf() (which now prints directly to stdout) and zfs_dbgmsg() (which adds messages to an internal list that can be dumped with zfs_dbgmsg_print()). In addition, this patch corrects an issue where ztest used a global variable to decide whether to dump the dbgmsg buffer on a crash. This did not work because ztest spins up more instances of itself using execv(), which did not copy the global variable to the new process. The option has been moved to the ztest_shared_opts_t which already exists for interprocess communication. This patch also changes zfs_dbgmsg_print() to use write() calls instead of printf() so that it will not fail when used in a signal handler. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes openzfs#8010
This patch fixes a lock inversion issue in txg_sync_thread() where the code would attempt hold the spa config lock as a reader while holding tx->tx_sync_lock. This races with spa_vdev_remove() which attempts to hold the tx->tx_sync_lock to assign a new tx (via spa_history_log_internal()) while holding the spa config lock as a writer. Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Co-authored-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes openzfs#8010
This patch fixes an issue discovered by ztest where dsl_scan_ddt_entry() could add I/Os to the dsl scan queues between when the scan had finished all required work and when the scan was marked as complete. This caused the scan to spin indefinitely without ending. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes openzfs#8010
This patch fixes an issue where ztest's deadman thread would trigger a panic because reconstructing artifically damaged blocks would take too long to reconstruct. This patch simply limits how often ztest inflicts split-block damage and how many segments it can damage when it does. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes openzfs#8010
This patch fixes a small issue where the zil_check_log_chain() code path would hit an EBUSY error. This would occur when 2 threads attempted to call metaslab_activate() at the same time. In this case, the "loser" would receive an error code which should have been ignored, but was instead floated to the caller. This ended up resulting in an ENXIO being returned from from spa_ld_verify_logs(). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes openzfs#8010
spa->spa_vdev_removal is created in a sync task that is initiated via dsl_sync_task_nowait(). Since the task may not run before spa_vdev_remove() returns, we must wait at least 1 txg to ensure that the removal struct has been created. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes openzfs#8010
This patch corrects 2 small bugs where scn->scn_phys_cached was not properly updated to match the primary copy when it needed to be. The first resulted in the pause state not being properly updated and the second resulted in the cached version being completely zeroed even if the primary was not. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes openzfs#8010
This PR contains several patches which fix several common failures that buildbot is currently hitting when running
zloop.sh
. This patch also fixes a few long-standing bugs that prevented various debug output from being printed during ztest execution.Types of changes
Checklist:
Signed-off-by
.