Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checksum errors may not be counted #11545

Open
ahrens opened this issue Jan 29, 2021 · 2 comments
Open

Checksum errors may not be counted #11545

ahrens opened this issue Jan 29, 2021 · 2 comments
Assignees
Labels
Status: Understood The root cause of the issue is known Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@ahrens
Copy link
Member

ahrens commented Jan 29, 2021

System information

Type Version/Name
Distribution Name
Distribution Version
Linux Kernel
Architecture
ZFS Version after 4f07282
SPL Version

Describe the problem you're observing

If a block is damaged after being repaired once, when it is repaired for the second time, the checksum error is not reported. This causes confusion (e.g. while testing) because there is no visibility into the checksum errors that are being detected (and potentially corrected).

This is a change in behavior caused by #10861. I understand the desire to limit the rate of event generation since we keep so few of them. However:

  1. This justification doesn't apply to the checksum error counts (vs_checksum_errors)- it doesn't cost anything to count to a large number.
  2. after a block is repaired (e.g. by zpool scrub) or errors are discarded (zpool clear), it would be reasonable to report the error again (even to generate another event).

I'd suggest that we make at least one (and perhaps all) of the following changes:

  1. always count the checksum errors
  2. reset the "recent" errors when a scrub completes, so that newly-discovered errors will be logged and counted
  3. reset the "recent" errors when zpool clear is run

Describe how to reproduce the problem

zpool create ... raidz ...
silently damage one disk (dd of=/dev/dsk/...)
zpool scrub
Scrub reports that it repaired some space, and vdev reports some checksum errors:

  scan: scrub repaired 1.00M in 00:00:03 with 0 errors on Fri Jan 29 04:32:40 2021
config:

	NAME                         STATE     READ WRITE CKSUM
	test                         ONLINE       0     0     0
	  raidz1-0                   ONLINE       0     0     0
	    /var/tmp/expand_vdevs/1  ONLINE       0     0    28
	    /var/tmp/expand_vdevs/2  ONLINE       0     0     0
	    /var/tmp/expand_vdevs/3  ONLINE       0     0     0
	    /var/tmp/expand_vdevs/4  ONLINE       0     0     0

silently damage one disk AGAIN (dd of=/dev/dsk/...)
zpool scrub AGAIN
Scrub reports that it repaired some space, BUT vdev reports no checksum errors:

  scan: scrub repaired 1.00M in 00:00:02 with 0 errors on Fri Jan 29 04:33:01 2021
config:

	NAME                         STATE     READ WRITE CKSUM
	test                         ONLINE       0     0     0
	  raidz1-0                   ONLINE       0     0     0
	    /var/tmp/expand_vdevs/1  ONLINE       0     0     0
	    /var/tmp/expand_vdevs/2  ONLINE       0     0     0
	    /var/tmp/expand_vdevs/3  ONLINE       0     0     0
	    /var/tmp/expand_vdevs/4  ONLINE       0     0     0

Include any warning/errors/backtraces from the system logs

@don-brady @behlendorf

@ahrens ahrens added Type: Defect Incorrect behavior (e.g. crash, hang) Status: Triage Needed New issue which needs to be triaged labels Jan 29, 2021
@don-brady
Copy link
Contributor

don-brady commented Jan 29, 2021

The issue with (1) is that it only takes 10 checksum errors over 10 mins to trigger a vdev degrade. So ideally we can continue to limit counting duplicates to avoid degrading a vdev. For (1) some of the ZTS tests might need to account for a discrepancy in event count vs error counts (TBD).

I like the idea of having operations in (2) and (3) clear out the duplicate tracking cache so that new errors can be detected and counted.

@don-brady don-brady self-assigned this Jan 29, 2021
@behlendorf behlendorf added Status: Understood The root cause of the issue is known and removed Status: Triage Needed New issue which needs to be triaged labels Feb 1, 2021
@behlendorf
Copy link
Contributor

Both (2) and (3) make sense to me. As for (1) I agree with @don-brady that we'll need to make sure that reporting all duplicates won't make the FMA / ZED code trigger happy.

This is a little out of scope, but @ahrens use of the term "recent" errors reminded me of this. It's come up a couple of times that the documentation isn't clear that the READ, WRITE, and CKSUM errors are in no way persistent. That is they're lost after an export/import. This is the long standing behavior, but I can see how users could find it unexpected . It'd be nice to update the documentation along with this change to make it clear only "recent errors" are reported and the counters are not persistent.

behlendorf pushed a commit that referenced this issue Feb 20, 2021
Fix regression seen in issue #11545 where checksum errors 
where not being counted or showing up in a zpool event.

Reviewed-by: Matthew Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Don Brady <[email protected]>
Closes #11609
jsai20 pushed a commit to jsai20/zfs that referenced this issue Mar 30, 2021
Fix regression seen in issue openzfs#11545 where checksum errors 
where not being counted or showing up in a zpool event.

Reviewed-by: Matthew Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Don Brady <[email protected]>
Closes openzfs#11609
sempervictus pushed a commit to sempervictus/zfs that referenced this issue May 31, 2021
Fix regression seen in issue openzfs#11545 where checksum errors 
where not being counted or showing up in a zpool event.

Reviewed-by: Matthew Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Don Brady <[email protected]>
Closes openzfs#11609
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Understood The root cause of the issue is known Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

3 participants