Intel QAT support when root is on a ZFS filesystem #8323

geppi · 2019-01-21T21:58:14Z

System information

Type	Version/Name
Distribution Name	Debian
Distribution Version	9.5
Linux Kernel	4.15.18
Architecture	amd64
ZFS Version	0.7.12
SPL Version	0.7.12

This feature request is about enabling Intel Quick Assist Technology (QAT) support when booting from a ZFS filesystem.

Background

Intel Quick Assist Technology (QAT) support was introduced to ZFS initially for hardware accelerated compression. #5846
It was extended to support acceleration of AES-GCM encryption #7282
and the acceleration of SHA256 and SHA512 checksums. #7295

The current implementation initializes the QAT support for ZFS when the ZFS kernel module is loaded.
A prerequisite for this initialization to succeed is that the Intel QAT driver has already been initialized.
If this is not the case the QAT support for ZFS will not be available even if the QAT driver becomes available at a later stage.

Issue

When booting a system from a ZFS filesystem the ZFS kernel module is loaded in an early phase from the root filesystem in the initramfs.
Currently the initramfs does not provide the QAT driver initialization required by the QAT support initialization routine in the ZFS kernel module.
Therefore the QAT support for ZFS initialization always fails when the root filesystem is ZFS.

Motivation

Using ZFS as the root filesystem is very desirable because of the copy on write based guaranteed file system integrity and the possibility to implement redundancy by using multiple disks for a ZFS root filesystem mirror.

Currently it is not possible to have both at the same time, a ZFS root filesystem and QAT support for ZFS.

cfzhu · 2019-01-23T09:17:53Z

@wli5 , Hi @geppi , thanks very much for your work.
For this issue, I changed the call time of qat_dc_init() & qat_crypt_init(). Now they are called by qat_dc_use_accel() & qat_checksum_use_accel().

When the QAT driver is "DOWN", the data will be processed by software.
When the QAT driver is "UP", We can enable QAT support by
"echo 0 >/sys/module/zfs/parameters/zfs_qat_compress_disable".

I have commited the changes in my repository "https://github.com/cfzhu/zfs.git", the branch name is "qat"

I do not know if these changes are useful, could you do some test for this? Thank you.

wli5 · 2019-01-24T04:12:37Z

Thanks @cfzhu
@geppi please verify the changes, and if it works, we can create a PR to merge the changes to the main branch.

geppi · 2019-01-24T17:57:14Z

@cfzhu Thank you ! That was quick.
@wli5 I ran a short test and it works.
I get QAT accelerated gzip compression now on a system with root on ZFS.
I will run a few more tests over the weekend and will report.

geppi · 2019-02-05T21:24:47Z

@cfzhu @wli5 Sorry for the delay in coming back but I was fighting a strange issue when testing the new code. It seems there is a problem with the checksum accelaration. Let me demonstrate.

First I have checksum acceleration disabled and create a pool:

root@zfs-dev:~# cat /sys/module/zfs/parameters/zfs_qat_checksum_disable
1
root@zfs-dev:~# zpool create Optane /dev/disk/by-id/nvme-INTEL_SSDPE21D280GA_PHM273910059280AGN
root@zfs-dev:~# zpool status Optane
  pool: Optane
 state: ONLINE
  scan: none requested
config:

        NAME                                           STATE     READ WRITE CKSUM
        Optane                                         ONLINE       0     0     0
          nvme-INTEL_SSDPE21D280GA_PHM273910059280AGN  ONLINE       0     0     0

errors: No known data errors

I can export the pool and reimport without problems

root@zfs-dev:~# zpool export Optane
root@zfs-dev:~# zpool import Optane
root@zfs-dev:~# zpool status Optane
  pool: Optane
 state: ONLINE
  scan: none requested
config:

        NAME                                           STATE     READ WRITE CKSUM
        Optane                                         ONLINE       0     0     0
          nvme-INTEL_SSDPE21D280GA_PHM273910059280AGN  ONLINE       0     0     0

errors: No known data errors

As soon as I enable checksum accelaration there is something strange happening.
The pool cannot be imported because it's reported as corrupted.

root@zfs-dev:~# zpool export Optane
root@zfs-dev:~# echo 0 > /sys/module/zfs/parameters/zfs_qat_checksum_disable
root@zfs-dev:~# zpool import
   pool: Optane
     id: 3133514996230602287
  state: FAULTED
 status: One or more devices contains corrupted data.
 action: The pool cannot be imported due to damaged devices or data.
   see: http://zfsonlinux.org/msg/ZFS-8000-5E
 config:

        Optane                                         FAULTED  corrupted data
          nvme-INTEL_SSDPE21D280GA_PHM273910059280AGN  UNAVAIL  corrupted data
root@zfs-dev:~# zpool import Optane
cannot import 'Optane': one or more devices is currently unavailable

At this stage I can still make everything sane by simply disabling checksum acceleration.

root@zfs-dev:~# echo 1 > /sys/module/zfs/parameters/zfs_qat_checksum_disable
root@zfs-dev:~# zpool import Optane
root@zfs-dev:~# zpool status
  pool: Optane
 state: ONLINE
  scan: none requested
config:

        NAME                                           STATE     READ WRITE CKSUM
        Optane                                         ONLINE       0     0     0
          nvme-INTEL_SSDPE21D280GA_PHM273910059280AGN  ONLINE       0     0     0

errors: No known data errors

But if I enable checksum acceleration now on the imported pool it gets corrupted for once and ever.

root@zfs-dev:~# echo 0 > /sys/module/zfs/parameters/zfs_qat_checksum_disable
root@zfs-dev:~# zpool status
  pool: Optane
 state: ONLINE
  scan: none requested
config:

        NAME                                           STATE     READ WRITE CKSUM
        Optane                                         ONLINE       0     0     0
          nvme-INTEL_SSDPE21D280GA_PHM273910059280AGN  ONLINE       0     0     0

errors: No known data errors
root@zfs-dev:~# zpool export Optane
root@zfs-dev:~# zpool import Optane
cannot import 'Optane': one or more devices is currently unavailable
root@zfs-dev:~# echo 0 > /sys/module/zfs/parameters/zfs_qat_checksum_disable
root@zfs-dev:~# zpool import Optane
cannot import 'Optane': one or more devices is currently unavailable
root@zfs-dev:~# zpool import
   pool: Optane
     id: 3133514996230602287
  state: FAULTED
 status: One or more devices contains corrupted data.
 action: The pool cannot be imported due to damaged devices or data.
   see: http://zfsonlinux.org/msg/ZFS-8000-5E
 config:

        Optane                                         FAULTED  corrupted data
          nvme-INTEL_SSDPE21D280GA_PHM273910059280AGN  UNAVAIL  corrupted data

It doesn't help anymore to disable checksum acceleration.
As you can see I actually didn't write any user data to the pool.
Just the zpool status command was sufficient to corrupt the pool.

The same happens if you create the pool right from the beginning with checksum acceleration enabled.

You can work with the pool as long as it is in the initial imported state.
Writing and reading data to/from the pool works.
Just after exporting the pool it is rendered corrupted.

It looks like there is something going on with the metadata of the pool because I didn't even expect that the SHA-256 acceleration would be used because the default checksum algorithm is fletcher4 as far as I remember. This is in coincidence with the fact that the qat counters for checksum in /proc/spl/kstat/zfs/qat are not increasing when reading or writing data from/to the pool.
Just when you issue the zpool command on the pool they increase by a small amount.

I would be very interested to understand what exactly is happening here to the pool because I was a little bit incautious when testing the new code and have a mirrored pool now that contains data i would love to get back.
I wonder if it can be made possible with a tweaked zfs build that ignores the checksums which seem to be corrupted. Or at least I hope that it's just the checksums and not the pool matadata itself.

cfzhu · 2019-02-12T07:27:19Z

Hi @geppi, I tried to reproduce the problem on my server, but the ZFS worked well.

root@qat-server-210:~# cat /sys/module/zfs/parameters/zfs_qat_checksum_disable
1
root@qat-server-210:~# zpool create Optane /dev/disk/by-id/nvme-INTEL_SSDPEDMD016T4_CVFT4476002H1P6DGN-part1
root@qat-server-210:~# zpool status Optane
  pool: Optane
 state: ONLINE
  scan: none requested
config:

        NAME                                                 STATE     READ WRITE CKSUM
        Optane                                               ONLINE       0     0     0
          nvme-INTEL_SSDPEDMD016T4_CVFT4476002H1P6DGN-part1  ONLINE       0     0     0

errors: No known data errors
root@qat-server-210:~# zpool export Optane
root@qat-server-210:~# zpool import Optane
root@qat-server-210:~# zpool status Optane
  pool: Optane
 state: ONLINE
  scan: none requested
config:

        NAME                                                 STATE     READ WRITE CKSUM
        Optane                                               ONLINE       0     0     0
          nvme-INTEL_SSDPEDMD016T4_CVFT4476002H1P6DGN-part1  ONLINE       0     0     0

errors: No known data errors
root@qat-server-210:~# zpool export Optane
root@qat-server-210:~# echo 0 > /sys/module/zfs/parameters/zfs_qat_checksum_disable
root@qat-server-210:~# zpool import
   pool: Optane
     id: 2504511189125541489
  state: ONLINE
 action: The pool can be imported using its name or numeric identifier.
 config:

        Optane                                               ONLINE
          nvme-INTEL_SSDPEDMD016T4_CVFT4476002H1P6DGN-part1  ONLINE
root@qat-server-210:~# zpool import Optane
root@qat-server-210:~# zpool status
  pool: Optane
 state: ONLINE
  scan: none requested
config:

        NAME                                                 STATE     READ WRITE CKSUM
        Optane                                               ONLINE       0     0     0
          nvme-INTEL_SSDPEDMD016T4_CVFT4476002H1P6DGN-part1  ONLINE       0     0     0

errors: No known data errors
root@qat-server-210:~# cat /sys/module/zfs/parameters/zfs_qat_checksum_disable
0

Is there something wrong with my steps?
Could you provide me with more detailed information?
thanks very much.

geppi · 2019-02-12T09:12:46Z

Hello @cfzhu, until that point everything was OK on my side as well. As I described in my last post the problem did start after I exported the zpool while the checksum acceleration was turned on.
So did you do a zpool export Optane following immediately after the steps you have listed above and did you try to do anything with the pool after that, like importing it again ? That's where the problems started for me.
As long as accelaration was disabled importing/exporting and working with the pool worked fine.
I had listed these initial steps only to show that everything is well as long as the accelaration is disabled.
Your listing above is missing the crucial step to export the pool with accelaration enabled.
Unfortunately I do currently not have more details. To be frank I have no idea why this is happening.
Looking at the code I would not expect the changes you made to have any negative effect compared to the implementation that does the QAT initialization only once when the zfs kernel modules are loaded.

Since I assume that the checksum accelaration is working well and has been thoroughly tested for the code version without the changes I can only assume that it's a problem with the particular setup on my side.

I'm running this on a system with a Denverton C3758 Soc and will further investigate.
I have started to look at it with a debugger to see why the import is failing after the zpool was once exported with accelaraiton enabled. I would hope to identify which part of the pool metadata is corrupted but unfortunately the crucial steps do of course happen in the zfs kernel module after the zfs ioctl is made to import the pool.

So I have started to dive into kernel module debugging and I'm currently creating the setup to do this.
However, I have to do this in my private time and I've never done this before, therefore it's progressing a little bit slow.

Nevertheless, maybe you can repeat the steps above and just add an export and another import of the pool at the end to see if you can replicate the problem ? Preferably on a Denverton Soc ?

geppi · 2019-02-12T14:49:13Z

@cfzhu Sorry, I should probably better read my own posts. What I said above is wrong. The procedure you performed should in fact have already shown the problem. As I wrote in my initial post I cannot import the pool as soon as checksum acceleration is turned on. So the zpool import command right after enabling accelearation should have already shown the pool as FAULTED.

Obviously you cannot replicate the problem on your side and it must be something related to my particular setup. This problem is driving me crazy.

wli5 · 2019-02-18T06:29:59Z

Hi @geppi, I just came back from vacation, just want to know if there is any updates from your side?

geppi · 2019-03-08T10:53:00Z

@cfzhu @wli5 I do now have a little more information but it doesn't explain why you can't reproduce the problem on your side.
However, maybe you can make a clue out of the following.

First I've tried to figure out what is wrong with a zpool exported with checksum acceleration turned on.
Since I used standard pool properties in all my experiments the checksum algorithm for all data is fletcher_4 which doesn't use QAT acceleration.
Therefore the problem must be with the pool metadata that uses SHA256 for labels and gang headers.
I modified zdb as proposed in #2509 to also provide checksum information for vdev labels and uberblocks.
It turns out that in case of a zpool that was just created and immediately exported all uberblock checksums are OK but the checksums of all 4 vdev labels are wrong.
Looking at the code I would assume that the uberblocks were processed by the standard non accelerated SHA256 routine because of the size constraint in qat_checksum_use_accel():

s_len >= QAT_MIN_BUF_SIZE &&
s_len <= QAT_MAX_BUF_SIZE);

which requires a data size between 4K and 128K to use QAT acceleration.
The size of the vdev data on the other hand is 112K and therefore it gets processed by qat_checksum().
The following is the output of my modified zdb for the 4 vdev labels. "Expected Cksum" is the value read from the disk label which was created by the qat_checksum() routine and "Actual Cksum" is the 'proper' SHA256 checksum value for the actual data in the vdev label calculated by the non accelerated SHA256 routine.

LABEL CHECKSUM ERROR !!!
Expected Cksum: 903af20700c9ffff e756afc0ffffffff 38074d239088ffff 00c0010000000000
Actual Cksum:   cc531b885760cd9a 20f3ea89b3c5e17a 2c4b6de84e872ca5 8a526d886e16b4c7

LABEL CHECKSUM ERROR !!!
Expected Cksum: 903af20700c9ffff e756afc0ffffffff e8054d239088ffff 00c0010000000000
Actual Cksum:   91e07a83772773ef 2d9ae40b50852ee1 553e2ffd1c07bd0f 2c0ed55eccd5881b

LABEL CHECKSUM ERROR !!!
Expected Cksum: 903af20700c9ffff e756afc0ffffffff 48034d239088ffff 00c0010000000000
Actual Cksum:   4a1aa65216d5b186 d9e9f5818dc5fc43 63367ebd15dacd83 f6306ba2e97cbf8a

LABEL CHECKSUM ERROR !!!
Expected Cksum: 903af20700c9ffff e756afc0ffffffff 30094d239088ffff 00c0010000000000
Actual Cksum:   290ece31e0a05a14 86942a2aaca0dc6f 4b572dd325342706 5806b096bb419c92

That would explain why you can't import the pool anymore when checksum acceleration is turned off.
To see what's going on when checksum acceleration is turned on I did run the kernel under gdb/kgdb and break at zio_checksum_error_impl() in line 517 which is after the "expected_cksum" value was extracted from the vdev label and the "actual_cksum" value was calculated from the vdev data by the qat_checksum() routine.

(gdb) p expected_cksum
$20 = {zc_word = {903af20700c9ffff, e756afc0ffffffff, 38074d239088ffff, 00c0010000000000}}
(gdb) p actual_cksum
$21 = {zc_word = {f039d50600c9ffff, e72696c0ffffffff, f841292f9088ffff, 00c0010000000000}}

(gdb) p expected_cksum
$23 = {zc_word = {903af20700c9ffff, e756afc0ffffffff, e8054d239088ffff, 00c0010000000000}}
(gdb) p actual_cksum
$24 = {zc_word = {f039d50600c9ffff, e72696c0ffffffff, f841292f9088ffff, 00c0010000000000}}

(gdb) p expected_cksum
$26 = {zc_word = {903af20700c9ffff, e756afc0ffffffff, 48034d239088ffff, 00c0010000000000}}
(gdb) p actual_cksum
$27 = {zc_word = {f039d50600c9ffff, e72696c0ffffffff, f841292f9088ffff, 00c0010000000000}}

(gdb) p expected_cksum
$29 = {zc_word = {903af20700c9ffff, e756afc0ffffffff, 30094d239088ffff, 00c0010000000000}}
(gdb) p actual_cksum
$30 = {zc_word = {f039d50600c9ffff, e72696c0ffffffff, f841292f9088ffff, 00c0010000000000}}

As you can see the checksum values calculated now are different from the ones that were calculated when the label was last written !

I couldn't drill down deeper to see what's going on in the actual checksum calculation routines, partially because I'm lacking the knowledge about what's going on in detail in the QAT checksum routines but also because the kernel didn't like to get stopped and continued when breaking in those routines, so the system got frozen very quick.

A notable observation is that "00c0010000000000" is a recurring pattern in the calculated checksum values. The above values are from a zpool created on a 1GB file.
For a pool created on an Optane device, my modified zdb did for example deliver this:

LABEL CHECKSUM ERROR !!!
Expected Cksum: 00c0010000000000 00c0010000000000 e87af1a4c4a6ffff d07af1a4c4a6ffff
Actual Cksum:   ce02a1505dd3eb33 4932bbec8caa717e fbd1dbb6b865feec d809074a98f0ee1b

LABEL CHECKSUM ERROR !!!
Expected Cksum: 00c0010000000000 00c0010000000000 e87af1a4c4a6ffff d07af1a4c4a6ffff
Actual Cksum:   5cbd1fe20ef0223f a4ca852d6f526bdf 1430b40a2d7558e7 714111f473cc9e4f

LABEL CHECKSUM ERROR !!!
Expected Cksum: 00c0010000000000 00c0010000000000 e87af1a4c4a6ffff d07af1a4c4a6ffff
Actual Cksum:   5182063a1f81e8e5 d22688a7e77a2f0a 236f21fb3c889157 0ef889e4cb2e2c4e

LABEL CHECKSUM ERROR !!!
Expected Cksum: 00c0010000000000 00c0010000000000 e87af1a4c4a6ffff d07af1a4c4a6ffff
Actual Cksum:   1eae7c0a2c4b8993 6ea115bbac730567 0285522b97ca8beb 25a2e6dd6f0f424e

Not only do you see again the "00c0010000000000" pattern but also the checksums for all 4 vdev labels are the same !

As already noted all this does not explain why it's only me having this propblem.
Could it make a difference that I'm running all this on a Atom C3758 SoC of a Supermicro A2SDi-H-TF MBD ?

wli5 · 2019-03-19T09:04:40Z

Hi @geppi, we eventually reproduced problem, and confirmed it is a real issue. The buffer "Cpa8U digest_buffer[sizeof (zio_cksum_t)]" in qat_checksum() stack cannot guarantee physical contiguous, and not cross pages, so the DMA of the digest result to this buffer may have problem. We will create a PR to fix it. Thanks for reporting the problem!

cfzhu · 2019-04-02T05:10:01Z

Hi @geppi , sorry for the late reply, as I found some new problems, I have commited the changes in my repository: https://github.com/cfzhu/zfs/tree/qat. Could you try it again? Thank you.

1. Support QAT when ZFS is root file-system: When ZFS module is loaded before QAT started, the QAT can be started again in post-process, e.g.: echo 0 > /sys/module/zfs/parameters/zfs_qat_compress_disable echo 0 > /sys/module/zfs/parameters/zfs_qat_encrypt_disable echo 0 > /sys/module/zfs/parameters/zfs_qat_checksum_disable 2. Verify alder checksum of the de-compress result 3. Allocate Digest, IV and AAD buffer in physical contiguous memory by QAT_PHYS_CONTIG_ALLOC. 4. Update the documentation for zfs_qat_compress_disable, zfs_qat_checksum_disable, zfs_qat_encrypt_disable. Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Weigang Li <[email protected]> Signed-off-by: Chengfeix Zhu <[email protected]> Closes #8323 Closes #8610

geppi mentioned this issue Jan 21, 2019

Reimplementation of QAT support for compression and checksums with Data Plane and and Intel QAT1.7.L.4.3.0-00033 drivers #8076

Closed

behlendorf added the Type: Defect Incorrect behavior (e.g. crash, hang) label Mar 19, 2019

cfzhu mentioned this issue Mar 20, 2019

Fix digest_buffer allocation in qat_checksum #8521

Merged

12 tasks

behlendorf closed this as completed in 45001b9 Mar 21, 2019

This was referenced Apr 11, 2019

Code improvement and bug fixes for QAT support #8608

Closed

Code improvement and bug fixes for QAT support #8609

Closed

Code improvement and bug fixes for QAT support #8610

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intel QAT support when root is on a ZFS filesystem #8323

Intel QAT support when root is on a ZFS filesystem #8323

geppi commented Jan 21, 2019

cfzhu commented Jan 23, 2019

wli5 commented Jan 24, 2019

geppi commented Jan 24, 2019

geppi commented Feb 5, 2019

cfzhu commented Feb 12, 2019

geppi commented Feb 12, 2019

geppi commented Feb 12, 2019

wli5 commented Feb 18, 2019

geppi commented Mar 8, 2019

wli5 commented Mar 19, 2019

cfzhu commented Apr 2, 2019

Intel QAT support when root is on a ZFS filesystem #8323

Intel QAT support when root is on a ZFS filesystem #8323

Comments

geppi commented Jan 21, 2019

System information

This feature request is about enabling Intel Quick Assist Technology (QAT) support when booting from a ZFS filesystem.

Background

Issue

Motivation

cfzhu commented Jan 23, 2019

wli5 commented Jan 24, 2019

geppi commented Jan 24, 2019

geppi commented Feb 5, 2019

cfzhu commented Feb 12, 2019

geppi commented Feb 12, 2019

geppi commented Feb 12, 2019

wli5 commented Feb 18, 2019

geppi commented Mar 8, 2019

wli5 commented Mar 19, 2019

cfzhu commented Apr 2, 2019