Checksum errors may not be counted #11545

ahrens · 2021-01-29T04:55:08Z

System information

Type	Version/Name
Distribution Name
Distribution Version
Linux Kernel
Architecture
ZFS Version	after `4f07282`
SPL Version

Describe the problem you're observing

If a block is damaged after being repaired once, when it is repaired for the second time, the checksum error is not reported. This causes confusion (e.g. while testing) because there is no visibility into the checksum errors that are being detected (and potentially corrected).

This is a change in behavior caused by #10861. I understand the desire to limit the rate of event generation since we keep so few of them. However:

This justification doesn't apply to the checksum error counts (vs_checksum_errors)- it doesn't cost anything to count to a large number.
after a block is repaired (e.g. by zpool scrub) or errors are discarded (zpool clear), it would be reasonable to report the error again (even to generate another event).

I'd suggest that we make at least one (and perhaps all) of the following changes:

always count the checksum errors
reset the "recent" errors when a scrub completes, so that newly-discovered errors will be logged and counted
reset the "recent" errors when zpool clear is run

Describe how to reproduce the problem

zpool create ... raidz ...
silently damage one disk (dd of=/dev/dsk/...)
zpool scrub
Scrub reports that it repaired some space, and vdev reports some checksum errors:

  scan: scrub repaired 1.00M in 00:00:03 with 0 errors on Fri Jan 29 04:32:40 2021
config:

	NAME                         STATE     READ WRITE CKSUM
	test                         ONLINE       0     0     0
	  raidz1-0                   ONLINE       0     0     0
	    /var/tmp/expand_vdevs/1  ONLINE       0     0    28
	    /var/tmp/expand_vdevs/2  ONLINE       0     0     0
	    /var/tmp/expand_vdevs/3  ONLINE       0     0     0
	    /var/tmp/expand_vdevs/4  ONLINE       0     0     0

silently damage one disk AGAIN (dd of=/dev/dsk/...)
zpool scrub AGAIN
Scrub reports that it repaired some space, BUT vdev reports no checksum errors:

  scan: scrub repaired 1.00M in 00:00:02 with 0 errors on Fri Jan 29 04:33:01 2021
config:

	NAME                         STATE     READ WRITE CKSUM
	test                         ONLINE       0     0     0
	  raidz1-0                   ONLINE       0     0     0
	    /var/tmp/expand_vdevs/1  ONLINE       0     0     0
	    /var/tmp/expand_vdevs/2  ONLINE       0     0     0
	    /var/tmp/expand_vdevs/3  ONLINE       0     0     0
	    /var/tmp/expand_vdevs/4  ONLINE       0     0     0

Include any warning/errors/backtraces from the system logs

@don-brady @behlendorf

The text was updated successfully, but these errors were encountered:

don-brady · 2021-01-29T05:11:53Z

~~The issue with (1) is that it only takes 10 checksum errors over 10 mins to trigger a vdev degrade. So ideally we can continue to limit counting duplicates to avoid degrading a vdev.~~ For (1) some of the ZTS tests might need to account for a discrepancy in event count vs error counts (TBD).

I like the idea of having operations in (2) and (3) clear out the duplicate tracking cache so that new errors can be detected and counted.

behlendorf · 2021-02-01T20:18:28Z

Both (2) and (3) make sense to me. As for (1) I agree with @don-brady that we'll need to make sure that reporting all duplicates won't make the FMA / ZED code trigger happy.

This is a little out of scope, but @ahrens use of the term "recent" errors reminded me of this. It's come up a couple of times that the documentation isn't clear that the READ, WRITE, and CKSUM errors are in no way persistent. That is they're lost after an export/import. This is the long standing behavior, but I can see how users could find it unexpected . It'd be nice to update the documentation along with this change to make it clear only "recent errors" are reported and the counters are not persistent.

Fix regression seen in issue #11545 where checksum errors where not being counted or showing up in a zpool event. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #11609

Fix regression seen in issue openzfs#11545 where checksum errors where not being counted or showing up in a zpool event. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes openzfs#11609

ahrens added Type: Defect Incorrect behavior (e.g. crash, hang) Status: Triage Needed New issue which needs to be triaged labels Jan 29, 2021

don-brady self-assigned this Jan 29, 2021

behlendorf added Status: Understood The root cause of the issue is known and removed Status: Triage Needed New issue which needs to be triaged labels Feb 1, 2021

ahrens mentioned this issue Feb 11, 2021

checksum error not always reported in "zpool status" #11589

Open

don-brady mentioned this issue Feb 16, 2021

Checksum errors may not be counted #11609

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checksum errors may not be counted #11545

Checksum errors may not be counted #11545

ahrens commented Jan 29, 2021

don-brady commented Jan 29, 2021 •

edited

Loading

behlendorf commented Feb 1, 2021

Checksum errors may not be counted #11545

Checksum errors may not be counted #11545

Comments

ahrens commented Jan 29, 2021

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

don-brady commented Jan 29, 2021 • edited Loading

behlendorf commented Feb 1, 2021

don-brady commented Jan 29, 2021 •

edited

Loading