Remove the bio_empty_barrier() check #1318

dechamps · 2013-02-24T13:13:20Z

NOTE: this branch needs 962020a in order to build correctly. See #1317. In addition, I did not check whether the resurrected flush code actually works correctly.

To determine whether the kernel is capable of handling empty barrier BIOs, we check for the presence of the bio_empty_barrier() macro, which was introduced in 2.6.24. If this macro is defined, then we can flush disk vdevs; if it isn't, then flushing is disabled.

Unfortunately, the bio_empty_barrier() macro was removed in 2.6.37, even though the kernel is still capable of handling empty barrier BIOs.

As a result, flushing is effectively disabled on kernels >= 2.6.37, meaning that starting from this kernel version, zfs doesn't use barriers to guarantee on-disk data consistency. In other words, it behaves as if zfs_nocacheflush was set. This is quite bad and can lead to potential data corruption on power failures.

This patch fixes the issue by removing the configure check for bio_empty_barrier(), as we don't support kernels <= 2.6.24 anymore.

Thanks to Richard Kojedzinszky for catching this nasty bug. Note that this could also explain why Phoronix was so skeptical in their benchmarks:

However, by the margins that ZFS was faster, one would have to wonder whether it is properly synchronizing to the disk.

To determine whether the kernel is capable of handling empty barrier BIOs, we check for the presence of the bio_empty_barrier() macro, which was introduced in 2.6.24. If this macro is defined, then we can flush disk vdevs; if it isn't, then flushing is disabled. Unfortunately, the bio_empty_barrier() macro was removed in 2.6.37, even though the kernel is still capable of handling empty barrier BIOs. As a result, flushing is effectively disabled on kernels >= 2.6.37, meaning that starting from this kernel version, zfs doesn't use barriers to guarantee on-disk data consistency. This is quite bad and can lead to potential data corruption on power failures. This patch fixes the issue by removing the configure check for bio_empty_barrier(), as we don't support kernels <= 2.6.24 anymore. Thanks to Richard Kojedzinszky for catching this nasty bug.

dechamps · 2013-02-24T13:23:33Z

@behlendorf: I feel very bad knowing that some kernel change that went unnoticed just happened to disable critical corruption protection code in ZoL. When a new kernel is released, we should definitely do a diff check on zfs_config.h between the previous kernel version and the new one. This would prevent this kind of (quite serious) bug from staying unnoticed.

behlendorf · 2013-02-24T18:48:22Z

@dechamps Bad, bad, bad. Thank you for jumping on this right away. I should have noticed the bio_empty_barrier() check was wrong when I was reviewing another 2.6.37 change in this exact same area, 96801d2. We'll have to be more careful when adding support for a new kernels. Your idea about doing a diff check against the zfs_config.h is a good one. There's probably a case to be made we should shed all the legacy pre-2.6.26 compatibility too.

Both patches look good to me. I'm running them through the automated testing now then I'll do some manual testing to make sure it's working as expected before I merge it.

dechamps · 2013-02-24T18:52:13Z

Both patches look good to me. I'm running them through the automated testing now then I'll do some manual testing to make sure it's working as expected before I merge it.

Yep. As I said, I didn't have time to test the patch (I just made sure it builds successfully). I'm not entirely sure, but it seems Richard is having issues with msync() when this patch is in place. This resembles #907.

behlendorf · 2013-02-24T23:24:27Z

The automated testing went well and using blktrace I was able to manually verify the cache flushes are being issued correctly. See the following output from a txg sync where the labels are being written. You can clearly see a sync being issued before the labels are written and then one immediately after.

This fix shouldn't have a huge impact of msync() times, certainly nothing that could be measured in seconds. It sounds like there's something else to investigate there.

  ...
>>>  8,32   0      569   100.876623318  5127  Q FWFS [z_ioctl_iss/0]
  8,32   0      570   100.876624401  5127  G FWFS [z_ioctl_iss/0]
  8,32   0      571   100.876625724  5127  P   N [z_ioctl_iss/0]
  8,32   0      572   100.876626397  5127  I FWFS [z_ioctl_iss/0]
  8,32   0      573   100.917887425  5076  A   W 2488 + 8 <- (8,33) 440
  8,32   0      574   100.917888250  5076  Q   W 2488 + 8 [z_null_iss/0]
  8,32   0      575   100.917889624  5076  G   W 2488 + 8 [z_null_iss/0]
  8,32   0      576   100.917890964  5076  P   N [z_null_iss/0]
  8,32   0      577   100.917891599  5076  I   W 2488 + 8 [z_null_iss/0]
  8,32   0      578   100.917944378  5076  A   W 3000 + 8 <- (8,33) 952
  8,32   0      579   100.917944711  5076  Q   W 3000 + 8 [z_null_iss/0]
  8,32   0      580   100.917945551  5076  G   W 3000 + 8 [z_null_iss/0]
  8,32   0      581   100.917946237  5076  I   W 3000 + 8 [z_null_iss/0]
  8,32   0      582   100.917996901  5076  A   W 5860515256 + 8 <- (8,33) 5860513208
  8,32   0      583   100.917997174  5076  Q   W 5860515256 + 8 [z_null_iss/0]
  8,32   0      584   100.917997830  5076  G   W 5860515256 + 8 [z_null_iss/0]
  8,32   0      585   100.917998394  5076  I   W 5860515256 + 8 [z_null_iss/0]
  8,32   0      586   100.918074369  5076  A   W 5860515768 + 8 <- (8,33) 5860513720
  8,32   0      587   100.918074673  5076  Q   W 5860515768 + 8 [z_null_iss/0]
  8,32   0      588   100.918075391  5076  G   W 5860515768 + 8 [z_null_iss/0]
  8,32   0      589   100.918076181  5076  I   W 5860515768 + 8 [z_null_iss/0]
  8,32   0      590   100.918077145  5076  D   W 2488 + 8 [z_null_iss/0]
  8,32   0      591   100.918084689  5076  D   W 3000 + 8 [z_null_iss/0]
  8,32   0      592   100.918087262  5076  D   W 5860515256 + 8 [z_null_iss/0]
  8,32   0      593   100.918089783  5076  D   W 5860515768 + 8 [z_null_iss/0]
>>>  8,32   2     1105   100.912735821     0  C WFS 0 [0] 
  8,32   2     1106   100.918551143     0  C   W 2488 + 8 [0]
  8,32   2     1107   100.918698724     0  C   W 3000 + 8 [0]
  8,32   2     1108   100.918964391     0  C   W 5860515256 + 8 [0]
  8,32   2     1109   100.918970391     0  C   W 5860515768 + 8 [0]
>>>  8,32   0      594   100.923494176  5127  Q FWFS [z_ioctl_iss/0]
  8,32   0      595   100.923495281  5127  G FWFS [z_ioctl_iss/0]
  8,32   0      596   100.923496531  5127  P   N [z_ioctl_iss/0]
  8,32   0      597   100.923497185  5127  I FWFS [z_ioctl_iss/0]
>>>  8,32   2     1110   100.961024645     0  C WFS 0 [0]

ryao · 2013-02-25T15:04:11Z

@dechamps I doubt that missing write barriers caused phoronix's criticisms. As far as I can tell, those criticisms involve benchmarks that measure read performance. With that said, thanks for catching this. This should be fixed ASAP.

With that said, the uberblock history was devised to enable ZFS to recover on hardware that does not properly honor barriers, so kernels where barriers are currently broken still have that protection. I have done numerous unclean reboots without problems using ashift=12 with kernels >2.6.37, so that feature appears to be working properly. However, this might explain why corruption has been observed at ashift values greater than 13. At ashift=14, we are limited to an uberblock history of 8, which is probably not big enough to recover when barriers are broken.

dechamps · 2013-02-25T15:18:47Z

Sure, ZFS is able to recover from broken flushes with regard to file system consistency. It still breaks application data consistency, however, as the recently synced data won't find their way to the on-disk ZIL in the event of a power failure. In other words, fsync() is broken, as Richard's test shows. This is a big issue for reliable transaction (e.g. database) systems.

behlendorf · 2013-02-25T19:07:07Z

Merged.

d9b0ebb Remove the bio_empty_barrier() check.
d75af3c Use -Werror for all kernel configure tests.

andys · 2013-02-26T07:15:05Z

Does this affect all users? Is there any workaround other than recompiling?

dechamps · 2013-02-26T09:38:15Z

Does this affect all users?

All users running kernel >= 2.6.37 and using write cache on disk vdevs (which is usually the case).

Is there any workaround other than recompiling?

You could disable the write cache on your disk vdevs.

To determine whether the kernel is capable of handling empty barrier BIOs, we check for the presence of the bio_empty_barrier() macro, which was introduced in 2.6.24. If this macro is defined, then we can flush disk vdevs; if it isn't, then flushing is disabled. Unfortunately, the bio_empty_barrier() macro was removed in 2.6.37, even though the kernel is still capable of handling empty barrier BIOs. As a result, flushing is effectively disabled on kernels >= 2.6.37, meaning that starting from this kernel version, zfs doesn't use barriers to guarantee on-disk data consistency. This is quite bad and can lead to potential data corruption on power failures. This patch fixes the issue by removing the configure check for bio_empty_barrier(), as we don't support kernels <= 2.6.24 anymore. Thanks to Richard Kojedzinszky for catching this nasty bug. Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#1318

dechamps mentioned this pull request Feb 24, 2013

Use -Werror for all kernel configure tests #1317

Closed

behlendorf closed this in d9b0ebb Feb 25, 2013

ryao mentioned this pull request Mar 3, 2013

Another hang on read-only import #1333

Closed

ThomasKaiser mentioned this pull request Mar 23, 2019

Proposal: Use BTRFS as default and only filesystem for data pools openmediavault/openmediavault#101

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove the bio_empty_barrier() check #1318

Remove the bio_empty_barrier() check #1318

dechamps commented Feb 24, 2013

dechamps commented Feb 24, 2013

behlendorf commented Feb 24, 2013

dechamps commented Feb 24, 2013

behlendorf commented Feb 24, 2013

ryao commented Feb 25, 2013

dechamps commented Feb 25, 2013

behlendorf commented Feb 25, 2013

andys commented Feb 26, 2013

dechamps commented Feb 26, 2013

Remove the bio_empty_barrier() check #1318

Remove the bio_empty_barrier() check #1318

Conversation

dechamps commented Feb 24, 2013

dechamps commented Feb 24, 2013

behlendorf commented Feb 24, 2013

dechamps commented Feb 24, 2013

behlendorf commented Feb 24, 2013

ryao commented Feb 25, 2013

dechamps commented Feb 25, 2013

behlendorf commented Feb 25, 2013

andys commented Feb 26, 2013

dechamps commented Feb 26, 2013