Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the bio_empty_barrier() check #1318

Closed
wants to merge 1 commit into from

Conversation

dechamps
Copy link
Contributor

NOTE: this branch needs 962020a in order to build correctly. See #1317. In addition, I did not check whether the resurrected flush code actually works correctly.

To determine whether the kernel is capable of handling empty barrier BIOs, we check for the presence of the bio_empty_barrier() macro, which was introduced in 2.6.24. If this macro is defined, then we can flush disk vdevs; if it isn't, then flushing is disabled.

Unfortunately, the bio_empty_barrier() macro was removed in 2.6.37, even though the kernel is still capable of handling empty barrier BIOs.

As a result, flushing is effectively disabled on kernels >= 2.6.37, meaning that starting from this kernel version, zfs doesn't use barriers to guarantee on-disk data consistency. In other words, it behaves as if zfs_nocacheflush was set. This is quite bad and can lead to potential data corruption on power failures.

This patch fixes the issue by removing the configure check for bio_empty_barrier(), as we don't support kernels <= 2.6.24 anymore.

Thanks to Richard Kojedzinszky for catching this nasty bug. Note that this could also explain why Phoronix was so skeptical in their benchmarks:

However, by the margins that ZFS was faster, one would have to wonder whether it is properly synchronizing to the disk.

To determine whether the kernel is capable of handling empty barrier
BIOs, we check for the presence of the bio_empty_barrier() macro,
which was introduced in 2.6.24. If this macro is defined, then we can
flush disk vdevs; if it isn't, then flushing is disabled.

Unfortunately, the bio_empty_barrier() macro was removed in 2.6.37,
even though the kernel is still capable of handling empty barrier BIOs.

As a result, flushing is effectively disabled on kernels >= 2.6.37,
meaning that starting from this kernel version, zfs doesn't use
barriers to guarantee on-disk data consistency. This is quite bad and
can lead to potential data corruption on power failures.

This patch fixes the issue by removing the configure check for
bio_empty_barrier(), as we don't support kernels <= 2.6.24 anymore.

Thanks to Richard Kojedzinszky for catching this nasty bug.
@dechamps
Copy link
Contributor Author

@behlendorf: I feel very bad knowing that some kernel change that went unnoticed just happened to disable critical corruption protection code in ZoL. When a new kernel is released, we should definitely do a diff check on zfs_config.h between the previous kernel version and the new one. This would prevent this kind of (quite serious) bug from staying unnoticed.

@behlendorf
Copy link
Contributor

@dechamps Bad, bad, bad. Thank you for jumping on this right away. I should have noticed the bio_empty_barrier() check was wrong when I was reviewing another 2.6.37 change in this exact same area, 96801d2. We'll have to be more careful when adding support for a new kernels. Your idea about doing a diff check against the zfs_config.h is a good one. There's probably a case to be made we should shed all the legacy pre-2.6.26 compatibility too.

Both patches look good to me. I'm running them through the automated testing now then I'll do some manual testing to make sure it's working as expected before I merge it.

@dechamps
Copy link
Contributor Author

Both patches look good to me. I'm running them through the automated testing now then I'll do some manual testing to make sure it's working as expected before I merge it.

Yep. As I said, I didn't have time to test the patch (I just made sure it builds successfully). I'm not entirely sure, but it seems Richard is having issues with msync() when this patch is in place. This resembles #907.

@behlendorf
Copy link
Contributor

The automated testing went well and using blktrace I was able to manually verify the cache flushes are being issued correctly. See the following output from a txg sync where the labels are being written. You can clearly see a sync being issued before the labels are written and then one immediately after.

This fix shouldn't have a huge impact of msync() times, certainly nothing that could be measured in seconds. It sounds like there's something else to investigate there.

  ...
>>>  8,32   0      569   100.876623318  5127  Q FWFS [z_ioctl_iss/0]
  8,32   0      570   100.876624401  5127  G FWFS [z_ioctl_iss/0]
  8,32   0      571   100.876625724  5127  P   N [z_ioctl_iss/0]
  8,32   0      572   100.876626397  5127  I FWFS [z_ioctl_iss/0]
  8,32   0      573   100.917887425  5076  A   W 2488 + 8 <- (8,33) 440
  8,32   0      574   100.917888250  5076  Q   W 2488 + 8 [z_null_iss/0]
  8,32   0      575   100.917889624  5076  G   W 2488 + 8 [z_null_iss/0]
  8,32   0      576   100.917890964  5076  P   N [z_null_iss/0]
  8,32   0      577   100.917891599  5076  I   W 2488 + 8 [z_null_iss/0]
  8,32   0      578   100.917944378  5076  A   W 3000 + 8 <- (8,33) 952
  8,32   0      579   100.917944711  5076  Q   W 3000 + 8 [z_null_iss/0]
  8,32   0      580   100.917945551  5076  G   W 3000 + 8 [z_null_iss/0]
  8,32   0      581   100.917946237  5076  I   W 3000 + 8 [z_null_iss/0]
  8,32   0      582   100.917996901  5076  A   W 5860515256 + 8 <- (8,33) 5860513208
  8,32   0      583   100.917997174  5076  Q   W 5860515256 + 8 [z_null_iss/0]
  8,32   0      584   100.917997830  5076  G   W 5860515256 + 8 [z_null_iss/0]
  8,32   0      585   100.917998394  5076  I   W 5860515256 + 8 [z_null_iss/0]
  8,32   0      586   100.918074369  5076  A   W 5860515768 + 8 <- (8,33) 5860513720
  8,32   0      587   100.918074673  5076  Q   W 5860515768 + 8 [z_null_iss/0]
  8,32   0      588   100.918075391  5076  G   W 5860515768 + 8 [z_null_iss/0]
  8,32   0      589   100.918076181  5076  I   W 5860515768 + 8 [z_null_iss/0]
  8,32   0      590   100.918077145  5076  D   W 2488 + 8 [z_null_iss/0]
  8,32   0      591   100.918084689  5076  D   W 3000 + 8 [z_null_iss/0]
  8,32   0      592   100.918087262  5076  D   W 5860515256 + 8 [z_null_iss/0]
  8,32   0      593   100.918089783  5076  D   W 5860515768 + 8 [z_null_iss/0]
>>>  8,32   2     1105   100.912735821     0  C WFS 0 [0] 
  8,32   2     1106   100.918551143     0  C   W 2488 + 8 [0]
  8,32   2     1107   100.918698724     0  C   W 3000 + 8 [0]
  8,32   2     1108   100.918964391     0  C   W 5860515256 + 8 [0]
  8,32   2     1109   100.918970391     0  C   W 5860515768 + 8 [0]
>>>  8,32   0      594   100.923494176  5127  Q FWFS [z_ioctl_iss/0]
  8,32   0      595   100.923495281  5127  G FWFS [z_ioctl_iss/0]
  8,32   0      596   100.923496531  5127  P   N [z_ioctl_iss/0]
  8,32   0      597   100.923497185  5127  I FWFS [z_ioctl_iss/0]
>>>  8,32   2     1110   100.961024645     0  C WFS 0 [0]

@ryao
Copy link
Contributor

ryao commented Feb 25, 2013

@dechamps I doubt that missing write barriers caused phoronix's criticisms. As far as I can tell, those criticisms involve benchmarks that measure read performance. With that said, thanks for catching this. This should be fixed ASAP.

With that said, the uberblock history was devised to enable ZFS to recover on hardware that does not properly honor barriers, so kernels where barriers are currently broken still have that protection. I have done numerous unclean reboots without problems using ashift=12 with kernels >2.6.37, so that feature appears to be working properly. However, this might explain why corruption has been observed at ashift values greater than 13. At ashift=14, we are limited to an uberblock history of 8, which is probably not big enough to recover when barriers are broken.

@dechamps
Copy link
Contributor Author

Sure, ZFS is able to recover from broken flushes with regard to file system consistency. It still breaks application data consistency, however, as the recently synced data won't find their way to the on-disk ZIL in the event of a power failure. In other words, fsync() is broken, as Richard's test shows. This is a big issue for reliable transaction (e.g. database) systems.

@behlendorf
Copy link
Contributor

Merged.

d9b0ebb Remove the bio_empty_barrier() check.
d75af3c Use -Werror for all kernel configure tests.

@andys
Copy link

andys commented Feb 26, 2013

Does this affect all users? Is there any workaround other than recompiling?

@dechamps
Copy link
Contributor Author

Does this affect all users?

All users running kernel >= 2.6.37 and using write cache on disk vdevs (which is usually the case).

Is there any workaround other than recompiling?

You could disable the write cache on your disk vdevs.

unya pushed a commit to unya/zfs that referenced this pull request Dec 13, 2013
To determine whether the kernel is capable of handling empty barrier
BIOs, we check for the presence of the bio_empty_barrier() macro,
which was introduced in 2.6.24. If this macro is defined, then we can
flush disk vdevs; if it isn't, then flushing is disabled.

Unfortunately, the bio_empty_barrier() macro was removed in 2.6.37,
even though the kernel is still capable of handling empty barrier BIOs.

As a result, flushing is effectively disabled on kernels >= 2.6.37,
meaning that starting from this kernel version, zfs doesn't use
barriers to guarantee on-disk data consistency. This is quite bad and
can lead to potential data corruption on power failures.

This patch fixes the issue by removing the configure check for
bio_empty_barrier(), as we don't support kernels <= 2.6.24 anymore.

Thanks to Richard Kojedzinszky for catching this nasty bug.

Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#1318
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants