Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix several ztest failures #8010

Closed
wants to merge 10 commits into from
Closed

Conversation

tcaputi
Copy link
Contributor

@tcaputi tcaputi commented Oct 10, 2018

This PR contains several patches which fix several common failures that buildbot is currently hitting when running zloop.sh. This patch also fixes a few long-standing bugs that prevented various debug output from being printed during ztest execution.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the contributing document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted and contain Signed-off-by.
  • Change has been approved by a ZFS on Linux member.

@tcaputi tcaputi added the Status: Work in Progress Not yet ready for general review label Oct 10, 2018
@tcaputi tcaputi force-pushed the ztest_deadman_fix branch 2 times, most recently from b0fcf41 to c25607e Compare October 11, 2018 20:42
* Wait for the delay timer while checking occasionally
* if we should stop.
*/
if (gethrtime() - last_run < delay) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: personally I find this easier to grok, if (gethrtime() > last_run + delay) { }

cmd/ztest/ztest.c Show resolved Hide resolved
@tcaputi tcaputi force-pushed the ztest_deadman_fix branch 4 times, most recently from 4b35b26 to 3200d25 Compare October 16, 2018 09:44
@@ -523,10 +523,20 @@ _umem_logging_init(void)
static void
dump_debug_buffer(void)
{
if (!ztest_dump_debug_buffer)
ssize_t ret;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's declare this as int ret __attribute__((unused));, then you can drop the confusing (void) ret; below and the last phrase of the comment. The attribute is supported by both clang and gcc.

lib/libzpool/kernel.c Show resolved Hide resolved

/*
* We use write() in this function instead of printf()
* so it is sae to call from a signal handler.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/sae/safe

* GCC prints a warning if we ignore the return value
* of write(), even if we cast the result to void.
*/
(void) ret;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int ret __attribute__((unused));

@tcaputi tcaputi force-pushed the ztest_deadman_fix branch 3 times, most recently from ee90a59 to 157f98d Compare October 17, 2018 00:30
@tcaputi tcaputi force-pushed the ztest_deadman_fix branch 2 times, most recently from 147ec82 to 462cde0 Compare October 19, 2018 22:32
@behlendorf behlendorf added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels Oct 19, 2018
@behlendorf behlendorf requested review from ahrens and sdimitro October 19, 2018 22:47
@codecov
Copy link

codecov bot commented Oct 20, 2018

Codecov Report

Merging #8010 into master will decrease coverage by 0.08%.
The diff coverage is 77.41%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8010      +/-   ##
==========================================
- Coverage    78.5%   78.42%   -0.09%     
==========================================
  Files         377      377              
  Lines      114447   114488      +41     
==========================================
- Hits        89850    89784      -66     
- Misses      24597    24704     +107
Flag Coverage Δ
#kernel 78.72% <62.5%> (-0.07%) ⬇️
#user 67.18% <75.58%> (-0.29%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e871a8f...393b335. Read the comment docs.

@tcaputi tcaputi force-pushed the ztest_deadman_fix branch 3 times, most recently from d616c0a to 802f844 Compare October 22, 2018 18:55
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These all LGTM and I'd like to move forward with getting them merged. They don't address all the ztest failures which have been observed but they do significantly improve the debugging and resolve several of the most common failures.

802f844 Fix waiting in ztest_device_removal()
462cde0 Fix ENXIO from spa_ld_verify_logs() in ztest
21b99f1 Fix ztest deadman panic with indirect vdev damage
ee92121 Fix issue with scanning dedup blocks as scan ends
8b58758 Fix lock inversion in txg_sync_thread()
93b50ed Fix dbgmsg printing in ztest and zdb
11f6d39 Fix ASSERT in zil_create() during ztest
6ce063b Fix random ztest_deadman_thread failures

As for this change we may find we still want to make it, or something similar, but additional investigation is still needed to get to the root cause.

d7dd8e9 ztest: convert active removal flag to lock

@sdimitro
Copy link
Contributor

I'll give a review for this by tomorrow EOD the latest.

One quick question though:
From Brian's comment above it seems some of the code on this PR has made it to mainline. Can we update the PR so it is clear what problem we are trying to solve with this PR and also have the diffs showing only the new changes?

@behlendorf
Copy link
Contributor

@sdimitro none of the changes in this PR have been integrated yet, but several of them do fix issues which were introduced by recent features. In practice, I believe we've only encountered them with ztest.

The goal is to get ztest to the point where it can run indefinitely without encountering any unexpected failures. The first 8 commits I referenced above improve the debug-ability and address several of the most commonly observed failures, but by no means all of them.

also have the diffs showing only the new changes?

I'm not sure what you mean by this. You can individually inspect the fixes by selecting each commit hash. They've been grouped in to this PR to make testing easier, I agree we don't want to squash them when merging. If it's helpful for review purposes I'm sure we can round up some examples of each of these failures.

* completed. This check ensures we do not sio's during
* this period.
*/
if (scn->scn_done_txg != 0 && scn->scn_bytes_pending == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to check scn_bytes_pending?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to match the check in dsl_scan_sync(). We don't necessarily need that checkdepending on what we want. Having it there basically just ensures that if the scan is running we still scan incoming dedup blocks, even if the scan if finishing. Note that the "finishing" part of scanning can still take several hours depending on the block sizes and RAM given to the dsl scan queues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which check in dsl_scan_sync()? And what exactly do you mean by "finishing"? Once scn_done_txg is set, could we still wait hours?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which check in dsl_scan_sync()?

else if (scn->scn_is_sorted && scn->scn_bytes_pending != 0) {

And what exactly do you mean by "finishing"? Once scn_done_txg is set, could we still wait hours?

Yes. scn_done_txg is set when the last block of the scan is placed on one of the per-vdev scanning queues. The scan is not complete, however until all of the queued I/O has been issued. Back when I was testing this code, I saw this take about an hour on my test server, but I imagine it could be longer with other configurations. This check basically says "we won't add anything more to the queues after the normal scanning has completed adding its I/Os to the queues".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I wasn't that familiar with the changes made due to sorted scan here. Once scn_done_txg is set, we've found all the blocks to scan (presumably including dedup blocks). Once that has happened, do we still need to do explicit scans in dsl_scan_ddt_entry()? I don't think so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. ill make the change.

@@ -7384,7 +7384,7 @@ main(int argc, char **argv)
* segments can be reconstructed in a reasonable amount of time
* when reconstruction is known to be possible.
*/
zfs_reconstruct_indirect_damage_fraction = 4;
zfs_reconstruct_indirect_damage_fraction = 100;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain this a bit more? why can't we do vdev_indirect_splits_damage() every time? Why is 1/100th of the time OK?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was from @behlendorf. We were seeing several cases where reconstruction was taking a very long time. In fact, with 16 segments we were seeing ztest deadman timeouts. We wanted to reduce this workload so ztest has time to test other things so we lowered the max damaged segments to 12 and decreased how often they would occur.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so this isn't an issue of correctness (too much damage causing it to be unrecoverable), just performance (too much damage causing us to spend too much time doing successful recovery).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly.

@@ -1632,7 +1632,7 @@ vdev_indirect_splits_damage(indirect_vsd_t *iv, zio_t *zio)
}

iv->iv_attempts_max *= 2;
if (iv->iv_attempts_max > (1ULL << 16)) {
if (iv->iv_attempts_max >= (1ULL << 12)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this related to the ztest fix? Or is this just making it run faster by not damaging when it would cause more than 2^12 tries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above. with the existing code we can have 17 damaged segments which was causing the ztest deadman to trigger fairly often.

@@ -3529,13 +3528,9 @@ ztest_device_removal(ztest_ds_t *zd, uint64_t id)
uint64_t guid;
int error;

mutex_enter(&ztest_removal_lock);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the old code was effectively a tryenter, do we want to use that here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was also from @behlendorf. I think the idea here was to make sure that this code path actually gets to run. As he mentioned, we might want to change this since a lot of these code paths are actually valid to race against each other in the actual kernel module.

Copy link
Contributor

@behlendorf behlendorf Oct 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may, in fact as I mentioned in my comment, think we should hold off entirely on the boolean -> lock conversion patch for the moment. It's intention was to prevent racing between fault injection and removal, but we don't actually have clear evidence yet that it's needed.

@@ -533,10 +533,16 @@ _umem_logging_init(void)
static void
dump_debug_buffer(void)
{
if (!ztest_dump_debug_buffer)
ssize_t ret __attribute__((unused));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can discard the return value with (void) as we do elsewhere

Copy link
Contributor Author

@tcaputi tcaputi Oct 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that. GCC still complains. The only way to get it to shut up was to save the return value. 🤷‍♂️

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the error from gcc? (void) works in other places in this file (at least, for other functions)

Copy link
Contributor

@behlendorf behlendorf Oct 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: ignoring return value of ‘ret’, declared with attribute warn_unused_result [-Wunused-result]

This is a discussion which goes way back https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66425. The short version is the behavior here isn't clearly defined by the standard and is compiler specific.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh. Any idea why all the other (void)'s in this file work fine, but it won't work here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has something to do with the write() call and its attributes specifically, I believe.

Copy link
Contributor

@behlendorf behlendorf Oct 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For several years now most of the standard glibc functions have had their prototypes updated to include the warn_unused_result function attribute, which forces the warning. Which is why we only see this in some places.

#define __wur __attribute_warn_unused_result__

extern ssize_t write (int __fd, const void *__buf, size_t __n) __wur;

edit: I should add that I'm not actually sure the other (void)'s in this file do anything, other than act as documentation. At least with gcc.

@@ -866,7 +873,7 @@ process_options(int argc, char **argv)
usage(B_FALSE);
break;
case 'G':
ztest_dump_debug_buffer = B_TRUE;
zo->zo_dump_dbgmsg = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you change this from a boolean_t to an int?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a minor thing. The variable in the struct is an int and I wanted to match zo->zo_mmp_test

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you change the variable from a boolean_t to an int?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable needed to be moved into zo so that it could properly be shared between processes. This struct did not have any boolean fields, even for fields such as zo_mmp_test which are clearly logically boolean. I assumed this might have something to do with ensuring alignment so I was just following the precedent.

@@ -212,11 +215,28 @@ __dprintf(const char *file, const char *func, int line, const char *fmt, ...)
void
zfs_dbgmsg_print(const char *tag)
{
(void) printf("ZFS_DBGMSG(%s):\n", tag);
ssize_t ret __attribute__((unused));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(void)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above.


/*
* Get rid of trailing newline.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed? are we zfs_dbgmsg()-ing things that have newlines? Seems like we should fix that rather than changing the string after the fact (in some places but not others).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was taken directly from the zfs_debug.c code. I just wanted to ensure they behaved the same.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's fix zfs_debug.c too then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. I think I see why this was done. dprintf() strings (by convention it seems) have a trailing newline while zfs_dbgmsg() strings do not. After looking around, I don't see any examples of multi-line calls to either of these functions. Not sure what the best way to handle this is....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps move this to the if dprintf section then, and change it to remove only a trailing newline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I can do that.

*/
nl = strrchr(buf, '\n');
if (nl != NULL)
*nl = '\0';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't just get rid of trailing newlines, it gets rid of the last newline, which may have more text after it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above.

@behlendorf
Copy link
Contributor

Closing. All but 53cb0bf were integrated.

@tcaputi tcaputi mentioned this pull request Oct 25, 2018
12 tasks
ghfields pushed a commit to ghfields/zfs that referenced this pull request Oct 29, 2018
The zloop test has been failing in buildbot for the last few weeks
with various failures in ztest_deadman_thread(). This is due to the
fact that this thread is not stopped when performing pool import /
export tests as it should be. This patch simply corrects this.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes openzfs#8010
ghfields pushed a commit to ghfields/zfs that referenced this pull request Oct 29, 2018
This patch corrects an ASSERT in zil_create() that will only be
true if the call to zio_alloc_zil() does not fail.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes openzfs#8010
ghfields pushed a commit to ghfields/zfs that referenced this pull request Oct 29, 2018
This patch resolves a problem where the -G option in both zdb and
ztest would cause the code to call __dprintf() to print zfs_dbgmsg
output. This function was not properly wired to add messages to the
dbgmsg log as it is in userspace and so the messages were simply
dropped. This patch also tries to add some degree of distinction to
dprintf() (which now prints directly to stdout) and zfs_dbgmsg()
(which adds messages to an internal list that can be dumped with
zfs_dbgmsg_print()).

In addition, this patch corrects an issue where ztest used a global
variable to decide whether to dump the dbgmsg buffer on a crash.
This did not work because ztest spins up more instances of itself
using execv(), which did not copy the global variable to the new
process. The option has been moved to the ztest_shared_opts_t
which already exists for interprocess communication.

This patch also changes zfs_dbgmsg_print() to use write() calls
instead of printf() so that it will not fail when used in a signal
handler.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes openzfs#8010
ghfields pushed a commit to ghfields/zfs that referenced this pull request Oct 29, 2018
This patch fixes a lock inversion issue in txg_sync_thread() where
the code would attempt hold the spa config lock as a reader while
holding tx->tx_sync_lock. This races with spa_vdev_remove() which
attempts to hold the tx->tx_sync_lock to assign a new tx (via
spa_history_log_internal()) while holding the spa config lock as a
writer.

Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Co-authored-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes openzfs#8010
ghfields pushed a commit to ghfields/zfs that referenced this pull request Oct 29, 2018
This patch fixes an issue discovered by ztest where
dsl_scan_ddt_entry() could add I/Os to the dsl scan queues
between when the scan had finished all required work and
when the scan was marked as complete. This caused the scan
to spin indefinitely without ending.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes openzfs#8010
ghfields pushed a commit to ghfields/zfs that referenced this pull request Oct 29, 2018
This patch fixes an issue where ztest's deadman thread would
trigger a panic because reconstructing artifically damaged
blocks would take too long to reconstruct. This patch simply
limits how often ztest inflicts split-block damage and how
many segments it can damage when it does.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes openzfs#8010
ghfields pushed a commit to ghfields/zfs that referenced this pull request Oct 29, 2018
This patch fixes a small issue where the zil_check_log_chain()
code path would hit an EBUSY error. This would occur when
2 threads attempted to call metaslab_activate() at the same time.
In this case, the "loser" would receive an error code which should
have been ignored, but was instead floated to the caller. This
ended up resulting in an ENXIO being returned from from
spa_ld_verify_logs().

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes openzfs#8010
ghfields pushed a commit to ghfields/zfs that referenced this pull request Oct 29, 2018
spa->spa_vdev_removal is created in a sync task that is initiated
via dsl_sync_task_nowait(). Since the task may not run before
spa_vdev_remove() returns, we must wait at least 1 txg to ensure
that the removal struct has been created.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes openzfs#8010
ghfields pushed a commit to ghfields/zfs that referenced this pull request Oct 29, 2018
This patch corrects 2 small bugs where scn->scn_phys_cached was
not properly updated to match the primary copy when it needed to
be. The first resulted in the pause state not being properly
updated and the second resulted in the cached version being
completely zeroed even if the primary was not.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes openzfs#8010
BrainSlayer pushed a commit to BrainSlayer/zfs that referenced this pull request Nov 4, 2018
The zloop test has been failing in buildbot for the last few weeks
with various failures in ztest_deadman_thread(). This is due to the
fact that this thread is not stopped when performing pool import /
export tests as it should be. This patch simply corrects this.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes openzfs#8010
BrainSlayer pushed a commit to BrainSlayer/zfs that referenced this pull request Nov 4, 2018
This patch corrects an ASSERT in zil_create() that will only be
true if the call to zio_alloc_zil() does not fail.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes openzfs#8010
BrainSlayer pushed a commit to BrainSlayer/zfs that referenced this pull request Nov 4, 2018
This patch resolves a problem where the -G option in both zdb and
ztest would cause the code to call __dprintf() to print zfs_dbgmsg
output. This function was not properly wired to add messages to the
dbgmsg log as it is in userspace and so the messages were simply
dropped. This patch also tries to add some degree of distinction to
dprintf() (which now prints directly to stdout) and zfs_dbgmsg()
(which adds messages to an internal list that can be dumped with
zfs_dbgmsg_print()).

In addition, this patch corrects an issue where ztest used a global
variable to decide whether to dump the dbgmsg buffer on a crash.
This did not work because ztest spins up more instances of itself
using execv(), which did not copy the global variable to the new
process. The option has been moved to the ztest_shared_opts_t
which already exists for interprocess communication.

This patch also changes zfs_dbgmsg_print() to use write() calls
instead of printf() so that it will not fail when used in a signal
handler.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes openzfs#8010
BrainSlayer pushed a commit to BrainSlayer/zfs that referenced this pull request Nov 4, 2018
This patch fixes a lock inversion issue in txg_sync_thread() where
the code would attempt hold the spa config lock as a reader while
holding tx->tx_sync_lock. This races with spa_vdev_remove() which
attempts to hold the tx->tx_sync_lock to assign a new tx (via
spa_history_log_internal()) while holding the spa config lock as a
writer.

Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Co-authored-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes openzfs#8010
BrainSlayer pushed a commit to BrainSlayer/zfs that referenced this pull request Nov 4, 2018
This patch fixes an issue discovered by ztest where
dsl_scan_ddt_entry() could add I/Os to the dsl scan queues
between when the scan had finished all required work and
when the scan was marked as complete. This caused the scan
to spin indefinitely without ending.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes openzfs#8010
BrainSlayer pushed a commit to BrainSlayer/zfs that referenced this pull request Nov 4, 2018
This patch fixes an issue where ztest's deadman thread would
trigger a panic because reconstructing artifically damaged
blocks would take too long to reconstruct. This patch simply
limits how often ztest inflicts split-block damage and how
many segments it can damage when it does.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes openzfs#8010
BrainSlayer pushed a commit to BrainSlayer/zfs that referenced this pull request Nov 4, 2018
This patch fixes a small issue where the zil_check_log_chain()
code path would hit an EBUSY error. This would occur when
2 threads attempted to call metaslab_activate() at the same time.
In this case, the "loser" would receive an error code which should
have been ignored, but was instead floated to the caller. This
ended up resulting in an ENXIO being returned from from
spa_ld_verify_logs().

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes openzfs#8010
BrainSlayer pushed a commit to BrainSlayer/zfs that referenced this pull request Nov 4, 2018
spa->spa_vdev_removal is created in a sync task that is initiated
via dsl_sync_task_nowait(). Since the task may not run before
spa_vdev_remove() returns, we must wait at least 1 txg to ensure
that the removal struct has been created.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes openzfs#8010
BrainSlayer pushed a commit to BrainSlayer/zfs that referenced this pull request Nov 4, 2018
This patch corrects 2 small bugs where scn->scn_phys_cached was
not properly updated to match the primary copy when it needed to
be. The first resulted in the pause state not being properly
updated and the second resulted in the cached version being
completely zeroed even if the primary was not.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes openzfs#8010
GregorKopka pushed a commit to GregorKopka/zfs that referenced this pull request Jan 7, 2019
The zloop test has been failing in buildbot for the last few weeks
with various failures in ztest_deadman_thread(). This is due to the
fact that this thread is not stopped when performing pool import /
export tests as it should be. This patch simply corrects this.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes openzfs#8010
GregorKopka pushed a commit to GregorKopka/zfs that referenced this pull request Jan 7, 2019
This patch corrects an ASSERT in zil_create() that will only be
true if the call to zio_alloc_zil() does not fail.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes openzfs#8010
GregorKopka pushed a commit to GregorKopka/zfs that referenced this pull request Jan 7, 2019
This patch resolves a problem where the -G option in both zdb and
ztest would cause the code to call __dprintf() to print zfs_dbgmsg
output. This function was not properly wired to add messages to the
dbgmsg log as it is in userspace and so the messages were simply
dropped. This patch also tries to add some degree of distinction to
dprintf() (which now prints directly to stdout) and zfs_dbgmsg()
(which adds messages to an internal list that can be dumped with
zfs_dbgmsg_print()).

In addition, this patch corrects an issue where ztest used a global
variable to decide whether to dump the dbgmsg buffer on a crash.
This did not work because ztest spins up more instances of itself
using execv(), which did not copy the global variable to the new
process. The option has been moved to the ztest_shared_opts_t
which already exists for interprocess communication.

This patch also changes zfs_dbgmsg_print() to use write() calls
instead of printf() so that it will not fail when used in a signal
handler.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes openzfs#8010
GregorKopka pushed a commit to GregorKopka/zfs that referenced this pull request Jan 7, 2019
This patch fixes a lock inversion issue in txg_sync_thread() where
the code would attempt hold the spa config lock as a reader while
holding tx->tx_sync_lock. This races with spa_vdev_remove() which
attempts to hold the tx->tx_sync_lock to assign a new tx (via
spa_history_log_internal()) while holding the spa config lock as a
writer.

Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Co-authored-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes openzfs#8010
GregorKopka pushed a commit to GregorKopka/zfs that referenced this pull request Jan 7, 2019
This patch fixes an issue discovered by ztest where
dsl_scan_ddt_entry() could add I/Os to the dsl scan queues
between when the scan had finished all required work and
when the scan was marked as complete. This caused the scan
to spin indefinitely without ending.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes openzfs#8010
GregorKopka pushed a commit to GregorKopka/zfs that referenced this pull request Jan 7, 2019
This patch fixes an issue where ztest's deadman thread would
trigger a panic because reconstructing artifically damaged
blocks would take too long to reconstruct. This patch simply
limits how often ztest inflicts split-block damage and how
many segments it can damage when it does.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes openzfs#8010
GregorKopka pushed a commit to GregorKopka/zfs that referenced this pull request Jan 7, 2019
This patch fixes a small issue where the zil_check_log_chain()
code path would hit an EBUSY error. This would occur when
2 threads attempted to call metaslab_activate() at the same time.
In this case, the "loser" would receive an error code which should
have been ignored, but was instead floated to the caller. This
ended up resulting in an ENXIO being returned from from
spa_ld_verify_logs().

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes openzfs#8010
GregorKopka pushed a commit to GregorKopka/zfs that referenced this pull request Jan 7, 2019
spa->spa_vdev_removal is created in a sync task that is initiated
via dsl_sync_task_nowait(). Since the task may not run before
spa_vdev_remove() returns, we must wait at least 1 txg to ensure
that the removal struct has been created.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes openzfs#8010
GregorKopka pushed a commit to GregorKopka/zfs that referenced this pull request Jan 7, 2019
This patch corrects 2 small bugs where scn->scn_phys_cached was
not properly updated to match the primary copy when it needed to
be. The first resulted in the pause state not being properly
updated and the second resulted in the cached version being
completely zeroed even if the primary was not.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes openzfs#8010
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants