Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow write performance with zfs 0.8 #8836

Closed
mabod opened this issue May 30, 2019 · 55 comments
Closed

Slow write performance with zfs 0.8 #8836

mabod opened this issue May 30, 2019 · 55 comments
Labels
Type: Performance Performance improvement or performance problem

Comments

@mabod
Copy link

mabod commented May 30, 2019

System information

Type Version/Name
Distribution Name Manjaro
Distribution Version Testing
Linux Kernel 4.19.46-1-MANJARO
Architecture x86_64
ZFS Version 0.8.0-1
SPL Version 0.8.0-1

Describe the problem you're observing

I do frequent fio benchmarks with my pool "zstore" and just realized that write performance is dropping with zfs version 0.8.

With zfs version 0.7.13 I typically got around 230-250 write IOPS:

fio-output-zstore-32G-2019-05-15@06:52:  read: IOPS=240, BW=240MiB/s (252MB/s)(32.0GiB/136347msec)
fio-output-zstore-32G-2019-05-15@06:52:  write: IOPS=233, BW=234MiB/s (245MB/s)(32.0GiB/140079msec); 0 zone resets
fio-output-zstore-32G-2019-04-06@19:53:  read: IOPS=280, BW=281MiB/s (294MB/s)(32.0GiB/116694msec)
fio-output-zstore-32G-2019-04-06@19:53:  write: IOPS=254, BW=254MiB/s (267MB/s)(32.0GiB/128766msec); 0 zone resets
fio-output-zstore-32G-2019-03-13@15:12:  read: IOPS=286, BW=286MiB/s (300MB/s)(32.0GiB/114442msec)
fio-output-zstore-32G-2019-03-13@15:12:  write: IOPS=269, BW=270MiB/s (283MB/s)(32.0GiB/121379msec); 0 zone resets
fio-output-zstore-32G-2019-03-09@11:02:  read: IOPS=296, BW=296MiB/s (311MB/s)(32.0GiB/110551msec)
fio-output-zstore-32G-2019-03-09@11:02:  write: IOPS=249, BW=249MiB/s (262MB/s)(32.0GiB/131339msec); 0 zone resets
fio-output-zstore-32G-2019-03-08@14:28:  read: IOPS=305, BW=305MiB/s (320MB/s)(32.0GiB/107366msec)
fio-output-zstore-32G-2019-03-08@14:28:  write: IOPS=243, BW=243MiB/s (255MB/s)(32.0GiB/134811msec); 0 zone resets

with zfs version 0.8 I only get 160-190 write IOPS:

fio-output-zstore-0.8-32G-2019-05-30@11:01:  read: IOPS=265, BW=265MiB/s (278MB/s)(32.0GiB/123489msec)
fio-output-zstore-0.8-32G-2019-05-30@11:01:  write: IOPS=191, BW=192MiB/s (201MB/s)(32.0GiB/170900msec); 0 zone resets
fio-output-zstore-0.8-32G-2019-05-30@10:45:  read: IOPS=278, BW=278MiB/s (292MB/s)(32.0GiB/117837msec)
fio-output-zstore-0.8-32G-2019-05-30@10:45:  write: IOPS=160, BW=161MiB/s (168MB/s)(32.0GiB/204095msec); 0 zone resets
fio-output-zstore-0.8-32G-2019-05-29@08:12:  read: IOPS=270, BW=270MiB/s (283MB/s)(32.0GiB/121249msec)
fio-output-zstore-0.8-32G-2019-05-29@08:12:  write: IOPS=181, BW=181MiB/s (190MB/s)(32.0GiB/180892msec); 0 zone resets

The read IOPS seem to be the same in the range of 260-280. Where is this write performance difference coming from?

Here are the pool details:

zfs recordsize is 1M. No compression. No dedup

30# zpool status
  pool: zstore
 state: ONLINE
  scan: scrub repaired 0B in 0 days 16:57:34 with 0 errors on Mon Apr  1 23:52:01 2019
config:

	NAME                     STATE     READ WRITE CKSUM
	zstore                   ONLINE       0     0     0
	  mirror-0               ONLINE       0     0     0
	    sdb-WD-WCC4E5HF3P4S  ONLINE       0     0     0
	    sdc-WD-WCC4E1SSP28F  ONLINE       0     0     0
	  mirror-1               ONLINE       0     0     0
	    sdd-WD-WCC4E1SSP6NC  ONLINE       0     0     0
	    sda-WD-WCC7K7EK9VC4  ONLINE       0     0     0

errors: No known data errors

43# zfs get all zstore 
NAME    PROPERTY              VALUE                 SOURCE
zstore  type                  filesystem            -
zstore  creation              Di Jan 23 14:39 2018  -
zstore  used                  6,76T                 -
zstore  available             268G                  -
zstore  referenced            96K                   -
zstore  compressratio         1.03x                 -
zstore  mounted               yes                   -
zstore  quota                 none                  default
zstore  reservation           none                  default
zstore  recordsize            1M                    local
zstore  mountpoint            /mnt/zstore           local
zstore  sharenfs              off                   default
zstore  checksum              on                    default
zstore  compression           lz4                   local
zstore  atime                 on                    local
zstore  devices               on                    default
zstore  exec                  on                    default
zstore  setuid                on                    default
zstore  readonly              off                   default
zstore  zoned                 off                   default
zstore  snapdir               hidden                default
zstore  aclinherit            restricted            default
zstore  createtxg             1                     -
zstore  canmount              on                    default
zstore  xattr                 sa                    local
zstore  copies                1                     default
zstore  version               5                     -
zstore  utf8only              off                   -
zstore  normalization         none                  -
zstore  casesensitivity       sensitive             -
zstore  vscan                 off                   default
zstore  nbmand                off                   default
zstore  sharesmb              off                   default
zstore  refquota              none                  default
zstore  refreservation        none                  default
zstore  guid                  10936391047855543944  -
zstore  primarycache          all                   default
zstore  secondarycache        all                   default
zstore  usedbysnapshots       0B                    -
zstore  usedbydataset         96K                   -
zstore  usedbychildren        6,76T                 -
zstore  usedbyrefreservation  0B                    -
zstore  logbias               latency               default
zstore  objsetid              51                    -
zstore  dedup                 off                   default
zstore  mlslabel              none                  default
zstore  sync                  standard              default
zstore  dnodesize             legacy                default
zstore  refcompressratio      1.00x                 -
zstore  written               96K                   -
zstore  logicalused           6,99T                 -
zstore  logicalreferenced     42K                   -
zstore  volmode               default               default
zstore  filesystem_limit      none                  default
zstore  snapshot_limit        none                  default
zstore  filesystem_count      none                  default
zstore  snapshot_count        none                  default
zstore  snapdev               hidden                default
zstore  acltype               posixacl              local
zstore  context               none                  default
zstore  fscontext             none                  default
zstore  defcontext            none                  default
zstore  rootcontext           none                  default
zstore  relatime              on                    local
zstore  redundant_metadata    all                   default
zstore  overlay               off                   default
zstore  encryption            off                   default
zstore  keylocation           none                  default
zstore  keyformat             none                  default
zstore  pbkdf2iters           0                     default
zstore  special_small_blocks  0                     default

Describe how to reproduce the problem

I am using the following fio option files for read and write with a SIZE of 32G:

41# cat fio-bench-generic-seq-read.options 
[global]
bs=1m
ioengine=libaio
invalidate=1
refill_buffers
numjobs=1
fallocate=none
size=${SIZE}

[seq-read]
rw=read
stonewall

45# cat fio-bench-generic-seq-write.options 
[global]
bs=1m
ioengine=libaio
invalidate=1
refill_buffers
numjobs=1
fallocate=none
size=${SIZE}

[seq-write]
rw=write
stonewall
@gmelikov
Copy link
Member

Did you test on same kernel version? Looks like #8793

@mabod
Copy link
Author

mabod commented May 30, 2019

The values I am showing here are all kernel 4.19. I have a few numbers for kernel 5.0 which basically confirm the numbers from kernel 4.19. No significant difference by kernel version.

But the zfs version makes a big difference. write IOPS is down to 70 % with zfs 0.8. Average write IOPS 249 with zfs version 0.7.13 and 175 with version 0.8.

@behlendorf behlendorf added the Type: Performance Performance improvement or performance problem label May 30, 2019
@sjuxax
Copy link

sjuxax commented May 30, 2019

Since you're using 4.19.46, this is probably #8793 as mentioned above. The symbol export that allowed SIMD-accelerated checksums was removed from the 4.19 branch with 4.19.38. Maybe set checksum=off for the duration of the benchmark and see if that changes things?

@behlendorf
Copy link
Contributor

If this is caused by the lack of SIMD support them you should be able to see the same drop in performance using 0.7.13 and the 4.19.46 kernel. It would be good to know either way.

@mabod
Copy link
Author

mabod commented May 30, 2019

I did two runs with checksum=off and it does NOT make a difference. Write performance is still down to ca. 70 %.

My benchmark numbers for version 0.7.13 are from kernels 4.19.42, 4.19.34, 4.19.28 and 4.19.26 (following the Manjaro Testing upgrades). The benchmark numbers for version 0.8 are only for kernel 4.19.46.

Are you suggesting that this is a kernel regression?

@behlendorf
Copy link
Contributor

Since you achieved the expected performance using 0.7.13 and the 4.19.42 kernel that should rule out the kernel's SIMD changes as a cause. Further investigation is going to be needed to determine exactly why you're seeing a drop in write performance.

@johnnyjacq16
Copy link

In the zfs manpages it states to consider changing the dnodesize zfs property to auto.
Seen below is from the dnodesize section in the manpages:
Consider setting dnodesize to auto if the dataset uses the xattr=sa property setting and the workload makes heavy use of extended attributes. This may be applicable to SELinux-enabled systems, Lustre servers, and Samba servers, for example. Literal values are supported for cases where the optimal size is known in advance and for performance testing.
Also the recordsize of the dataset is 1M, this I think can cause issue depending on what your storing in that dataset due to zfs being a copy on write file system. Changing the recordsize value of the dataset will require removing and placing back all files back on the dataset to ensure that all files on the dataset uses the new recordsize value if changed.
ZFS support up to 16 MiB recordsize to get this change the zfs_max_recordsize file.
To view the value of this file do cat /sys/module/zfs/parameters/zfs_max_recordsize .
To change [I DO NOT RECOMMAND IT] it do echo (your value) > /sys/module/zfs/parameters/zfs_max_recordsize.
To get 16MiB the value should be echo $((16 * 1024 * 1024)) which is 16777216.
Changing the default value of 1048576 or echo $((1 * 1024 * 1024)) to a bigger value gives issues in deleting the file.
Note if changing the 16MiB to another value it would be echo $((<your value> * 1024 * 1024)).

@mabod
Copy link
Author

mabod commented May 30, 2019

The system is always idle when I do the tests. I am doing this already since a while. Unfortunately I have only kept the logs since March of this year. But the results have always been comparable as long as I remember. Even with recordsize 128k. Of course there is always some variance in the values. But a performance decrease of 30 % is a significant change.

@johnnyjacq16
Copy link

Look at the history of the pool zpool history <your pool name> | less and look for a time before the performance decrease, may help

@mabod
Copy link
Author

mabod commented May 31, 2019

There is nothing in the history other than the regular import or snapshot commands.

@mabod
Copy link
Author

mabod commented May 31, 2019

I did some more tests. Also with another pool. The other pool is a raidz2 with 6 drives in an external USB case. The interesting finding for me is, that this pool (zf1) is NOT showing performance differences. But I certainly see write performance issues with the internal pool (zstore).

I compared the out of "zfs get all" for both zstore and zf1 and there is no important difference other than mountpoint and such. Basic parameters are all the same.

I also doublechecked that checksum=on/off does not make a difference.

Once again some results for zstore:

old (good) values with zfs 0.7.13:

1  write: IOPS=255, BW=256MiB/s (268MB/s)(32.0GiB/128135msec); 0 zone resets
2  write: IOPS=238, BW=239MiB/s (250MB/s)(32.0GiB/137293msec); 0 zone resets
3  write: IOPS=245, BW=245MiB/s (257MB/s)(32.0GiB/133739msec); 0 zone resets
4  write: IOPS=243, BW=243MiB/s (255MB/s)(32.0GiB/134811msec); 0 zone resets
5  write: IOPS=249, BW=249MiB/s (262MB/s)(32.0GiB/131339msec); 0 zone resets
6  write: IOPS=269, BW=270MiB/s (283MB/s)(32.0GiB/121379msec); 0 zone resets
7  write: IOPS=254, BW=254MiB/s (267MB/s)(32.0GiB/128766msec); 0 zone resets
8  write: IOPS=233, BW=234MiB/s (245MB/s)(32.0GiB/140079msec); 0 zone resets

new (bad) values with zfs 0.8.0:

 1 write: IOPS=174, BW=175MiB/s (183MB/s)(32.0GiB/187521msec); 0 zone resets
 2 write: IOPS=188, BW=188MiB/s (197MB/s)(32.0GiB/174175msec); 0 zone resets
 3 write: IOPS=203, BW=204MiB/s (213MB/s)(32.0GiB/160953msec); 0 zone resets
 4 write: IOPS=205, BW=206MiB/s (216MB/s)(32.0GiB/159290msec); 0 zone resets
 5 write: IOPS=191, BW=192MiB/s (201MB/s)(32.0GiB/170795msec); 0 zone resets
 6 write: IOPS=159, BW=160MiB/s (168MB/s)(32.0GiB/204952msec); 0 zone resets
 7 write: IOPS=180, BW=181MiB/s (190MB/s)(32.0GiB/181212msec); 0 zone resets
 8 write: IOPS=194, BW=194MiB/s (204MB/s)(32.0GiB/168825msec); 0 zone resets
 9 write: IOPS=215, BW=216MiB/s (226MB/s)(32.0GiB/151945msec); 0 zone resets
10 write: IOPS=194, BW=195MiB/s (204MB/s)(32.0GiB/168349msec); 0 zone resets
11 write: IOPS=203, BW=204MiB/s (214MB/s)(32.0GiB/160770msec); 0 zone resets
12 write: IOPS=205, BW=206MiB/s (216MB/s)(32.0GiB/159360msec); 0 zone resets

@johnnyjacq16
Copy link

Let zfs report what is happening on the pool and on each vdev with this command zpool iostat -vl <your pool> .1. This command will auto-refresh every .1 second, which can be changed to whatever value.
This command will show you all I/O info that is happening on each vdev with the latency info.

Also use the command zpool iostat -vq <your pool> .1 which will show 'disk queued' info. Info that is waiting to be written on the disk.

zpool iostat -vr <your pool> .1 the -r option shows the size histograms for the leaf vdev's IO. This includes histograms of individual IOs (ind) and aggregate IOs (agg). These stats can be useful for observing how well IO aggregation is working.

The zpool iostat -c list a number of checks that can be done.
You can check for SMART, ATA and NVMe stuff example zpool iostat -c nvme_err.
If you see the following error, Can't run -c with root privileges unless ZPOOL_SCRIPTS_AS_ROOT is set.
Run then use ZPOOL_SCRIPTS_AS_ROOT=1 zpool iostat -c nvme_err.

Monitor zfs process while working, such of cache info, memory status etc.
cat /proc/spl/kstat/zfs/arcstats
To make it auto-refresh
watch -n .1 cat /proc/spl/kstat/zfs/arcstats

Also ensure that you have your ashift value are accurate with following blockdev --getpbsz /dev/sdXY.
the blockdev command shows the physical block (sector) size. (for all your disks).

Ashift info below:
At pool creation, ashift=12 should always be used, except with SSDs that have 8k sectors where ashift=13 is correct. A vdev of 512 byte disks using 4k sectors will not experience performance issues, but a 4k disk using 512 byte sectors will. Since ashift cannot be changed after pool creation, even a pool with only 512 byte disks should use 4k because those disks may need to be replaced with 4k disks or the pool may be expanded by adding a vdev composed of 4k disks. Because correct detection of 4k disks is not reliable, -o ashift=12 should always be specified during pool creation. See the ZFS on Linux FAQ for more details.

@richardelling
Copy link
Contributor

NB, running zpool iostat with short interval (eg < zfs_txg_timeout) is almost always a waste of effort.
Also, running a bunch of CLI collectors is difficult to grok.

A better solution is to use one of the telemetry collectors, telegraf or node_exporter, to collect the data and forward it to a TSDB, like influxdb or prometheus, and analyzed with tools like grafana.

@johnnyjacq16
Copy link

johnnyjacq16 commented Jun 1, 2019

@richardelling Could a telemetry collector, TSDB and an analyzed tool be implemented into zfs itself since working with iostat is a waste of effort and difficult to grok. I would like to know that all tools and features in zfs are useful, which I can use to gain meaningful information from zfs.

I have installed telegraf which is just pulling information from /proc/spl/kstat/zfs, I believe that a tool in zfs can do that and display a graph-like representation of the information including what is happening on all vdevs. It is also useful in troubleshooting performances also without the full bloat of influxdb or prometheus and grafana.

@richardelling
Copy link
Contributor

no, it is a really bad idea and goes counter to the UNIX philosophy. Today ZFS makes stats available, but reading them is not a free operation. So designing a monitoring system needs to meet very different business requirements. For this reason it is best to have integration to the best-in-class monitoring systems. I only mentioned a few of the open source tools that are popular. There are many more tools in the market.

@Setsuna-Xero
Copy link

For what its worth, I've also seen huge Performance decreases on my pool. Write speed has throttled down to 30MB/s from 600MB/s+
0.8rc3 and kernel 4.9.16-gentoo

If you've got a reasonable method for me to collect performance data I will also assist in this.

    NAME                                                  STATE     READ WRITE CKSUM
    zebras                                                ONLINE       0     0     0
      mirror-0                                            ONLINE       0     0     0
        ata-WDC_WD60EDAZ-11BMZB0_WD-WX61D88AZET6          ONLINE       0     0     0
        ata-WDC_WD60EFRX-68L0BN1_WD-WX51D88NL080          ONLINE       0     0     0
      mirror-1                                            ONLINE       0     0     0
        ata-WDC_WD60EDAZ-11BMZB0_WD-WX61DB72TP5S          ONLINE       0     0     0
        ata-WDC_WD60EFRX-68L0BN1_WD-WXB1HB4JKAM6          ONLINE       0     0     0
      mirror-2                                            ONLINE       0     0     0
        ata-WDC_WD60EFRX-68L0BN1_WD-WX71DB8KYUPY          ONLINE       0     0     0
        ata-WDC_WD60EFRX-68MYMN1_WD-WX21D9421XU3          ONLINE       0     0     0
    special
      mirror-3                                            ONLINE       0     0     0
        ata-KINGSTON_SA400S37240G_50026B76824BC4D8-part2  ONLINE       0     0     0
        ata-KINGSTON_SA400S37240G_50026B76824BC5F3-part2  ONLINE       0     0     0
    logs
      mirror-4                                            ONLINE       0     0     0
        ata-KINGSTON_SA400S37240G_50026B76824BC5F3-part4  ONLINE       0     0     0
        ata-KINGSTON_SA400S37240G_50026B76824BC4D8-part4  ONLINE       0     0     0

All direct attached from a Dell Perc h310 controller in IT mode.

@behlendorf
Copy link
Contributor

@Setsuna-Xero do I understand correctly that you see this performance drop for both 0.8.0-rc3 and the 0.8.0 tag?

@Setsuna-Xero
Copy link

@behlendorf
Sorry I forgot to include the previous kernel:
4.12.13 on 0.8rc3

I will be moving this array to another server with a 4.19.41 kernel as soon as the drive cages arrive however

@mabod
Copy link
Author

mabod commented Jun 7, 2019

It is striking to me that @Setsuna-Xero is seeing the performance drop also with a RAID10 system. Can it be that the RAID level makes the difference? I have another pool as RAIDZ2 which is not showing a performance drop.

@amissus
Copy link

amissus commented Jun 7, 2019

I have write performance problem after upgrading from 0.7.19 to 0.8.0 also. Tried with older kernel to exclude missing SIMD problem and my system is completely idle. Rsyncing same VM image from dedicated disk to ZFS pool:
0.7.19, performance as expected:
0 7 19
0.8.0, performance bad:
0 8 0

@Setsuna-Xero
Copy link

I'm getting 2-3MB/s with cp/cq and tar. rsync gets an order of magnitude higher at right around 30MB/s.
Previously on what ever 7.x pool I had from two of these disks, I would get over 100MB/s on a single mirror write speeds. Then once I moved to this pool, I had approximately 600MB/s, which then fell off to 20-30MB/s sometime after a kernel bump and moving to 8.0rc3

@herf
Copy link

herf commented Jul 17, 2019

I benchmarked sequential writes on a 6-disk RAIDz2 (all HDD) using Proxmox 6 with ZFS 0.8.1 and Kernel 5.0. The array struggled to maintain the single-disk sequential speed, around 200MB/sec.

An older ZoL build (0.7.13 with older kernel) shows more than double the speed with the same configuration, around 450MB/sec.

@interduo
Copy link

The branch 0.6.x was spining like a tornado.
The branch 0.7.x drops its performance about 30%.
And now there is next performance drop.

Does somebody compile&test master branch with commit e5db313 ?

@lukegalea
Copy link

Another "me too" over here. After upgrading to the newest Proxmox (with Zol 0.8.1) I can't sustain write speeds for more than a few seconds before they tank and I get lockups.

image

@herf
Copy link

herf commented Jul 29, 2019

Documenting this in case it helps. Seems clear that this is related to lack of SIMD - higher RAID-Z levels use a lot of CPU and scalar perf isn't enough.

cat /proc/spl/kstat/zfs/vdev_raidz_bench ("scalar" row) on a Xeon 4108:

gen_p (RAID-Z) is 1.13GB/sec
gen_pq (RAID-Z2) is 290MB/sec
and gen_pqr (RAID-Z3) is 132MB/sec.

SIMD makes everything 5-7x faster, so restoring SIMD should help this problem.

@interduo
Copy link

@amissus What version did You tested and showed results? 0.7.19 does not exist.

@amissus
Copy link

amissus commented Jul 29, 2019

I'm sorry, version 0.7.13 has expected performance for me and >= 0.8 has degraded and unstable performance.

@msLinuxNinja
Copy link

What exactly am I reading here?

[root@hostname~]# cat /proc/spl/kstat/zfs/vdev_raidz_bench
18 0 0x01 -1 0 5551518943 1459035087503366
implementation   gen_p           gen_pq          gen_pqr  
original         383443168       135674622       67712690 
scalar           1682391699      530611710       228126033
fastest          scalar          scalar          scalar   

@interduo
Copy link

@behlendorf this issue was created at 30 May, the solution for this issue come in master branch at 12 Jul. This is very important case for us (users). When do plan to do next release of ZFS with this commit?

What is the project politics for releases?
I didn't find any information about it on github or zol website.

@faern
Copy link

faern commented Sep 29, 2019

It seems zfs 0.8.2 was released, but without the fix in e5db313 😢
I don't know the reason for it not being included. But it seems there will be a few more months with crawling performance.

@gmelikov
Copy link
Member

@faern #9346

@DannCos
Copy link

DannCos commented Oct 3, 2019

Does this issue concern the kernel 3.10.0-1062.1.1.el7.x86_64 as well?

@behlendorf
Copy link
Contributor

@DannCos the 3.10.0-1062.1.1.el7.x86_64 is not effected by this issue.

@lucasRolff
Copy link

lucasRolff commented Oct 27, 2019

I decided to conduct some tests under CentOS 7 (with 3.10.0-1062.1.1.el7.x86_64) - the reason I decided to do this, was simply due to replacing a storage server, one running 0.7 and one running 0.8 - I experienced issues with slow read performance under the new system.

Old server:

  • Intel(R) Xeon(R) CPU E5-2420
  • 64GB DDR3 ECC RAM
  • 8x8TB spinning enterprise disks (In raidz2)
  • 4x 240GB SSD (2 for OS, and 2 split up for ZIL and l2arc)

New server:

  • Intel(R) Xeon(R) CPU E5-1650 v3
  • 128GB DDR4 ECC RAM
  • 10x10TB spinning enterprise disks (In raidz2)
  • 2x960GB Enterprise NVMe SSD (partitioned for OS, ZIL and l2arc)
  • 10g networking

Both servers use about 20TB of storage and store 280 million files.

The old server would restore a 1GB backup with 100k files in about 1.5 minutes where the new one would do the same folder in 17 minutes.

Note: Writes seems to be decent on both systems, reads being the main affected.

Both tests were performed on an idle system right after rebooting the system (to ensure that no cache or anything got hit).

atime turned off, lz4 compression turned on, dedup off.

It made me search and I found this thread regarding performance issues, so I wanted to test out various versions of ZoL as well as ZFS on FreeBSD 12.

For this I set up another machine:

  • Intel(R) Core(TM) i7-2600
  • 16GB DDR3 RAM
  • 4x4TB spinning enterprise disks

All tests below use the same zpool create parameters: atime=off, dedup=off, compression=lz4, ashift=12, and a reboot being performed between every test.

Test directory structure being 11294 megabyte and 311153 inodes.

It's also worth noting that the only data being stored on the pool is the test directory structure, nothing else - whether performance becomes worse as the dataset grows, I don't know (Hopefully it doesn't).

Backup/restore is performed using rsync on a local network (1gigabit/s) with no other communication happening:

ZFS 0.6 (Installed via Ubuntu 16.04):

zfs striped mirror backup: 2 min 1 sec
zfs striped mirror restore: 3 min 39 sec
zfs raidz2 backup: 2 min 22 sec
zfs raidz2 restore: 3 min 36 sec
zfs striped backup: 2 min 15 sec
zfs striped restore: 3 min 26 sec

ZFS 0.7 (Installed via CentOS 7.7 using zfs-release.el7_6):

zfs striped mirror backup: 2 min 8 sec
zfs striped mirror restore: 3 min 16 sec
zfs raidz2 backup: 2 min 10 sec
zfs raidz2 restore: 3 min 18 sec
zfs striped backup: 2 min 8 sec
zfs striped restore: 3 min 23 sec

ZFS 0.8 (Installed via CentOS 7.7 using zfs-release.el7_7):

zfs striped mirror backup: 2 min 9 sec
zfs striped mirror restore: 4 min 45 sec
zfs raidz2 backup: 2 min 9 sec
zfs raidz2 restore: 5 min 54 sec
zfs striped backup: 2 min 8 sec
zfs striped restore: 5 min 24 sec

Backup times (writing to ZFS) seems to stay pretty consistent in my case - likely also being limited by the 1g link between machines, but average is about the 2 minute and 10 seconds, or about 700mbps.

What surprises me is the drop between 0.7 and 0.8 is the read performance experienced, especially for raidz2. From 3 min and 18 seconds to 5 minutes and 54 seconds. That's 78% increase in restoration times.

Just for fun, I tried to give FreeBSD 12 a try:

zfs striped mirror: 4 min 22 sec
zfs raidz2: 5 min 20 sec
zfs striped: 5 min 4 sec

Whether it performs better under FreeBSD 11.x I haven't had the time to test yet.

Now, I'd expect performance to be roughly the same on the same hardware.

My tests conducted does still not explain the massive slowdown I experience between the two real systems with more powerful hardware - hopefully adding more memory to a system (64 vs 128GB) shouldn't make performance worse.

I know this issue is mainly related to write performance, however, I do find it important that read performance gets mentioned as well, especially under the 3.10.0-1062.1.1.el7.x86_64 which should not be affected by the SIMD.

It makes me believe that there may be some other regression between 0.7 and 0.8 that may affect the overall performance as well, other than the SIMD.

If people want me to test with some other settings, I'm more than happy to do so. Ideally, I want my backup server to remain snappy so in case of restores being needed that they can actually be performed rather quickly.

@interduo
Copy link

@lucasRolff could You do a benchmark for 0.8.3 version whitch was released few days ago?
Is this issue fully resolved?

@mabod
Copy link
Author

mabod commented Jan 27, 2020

I just did a test with 0.8.3 and kernel 5.4.14. I do see better IOPS.

Average of 7 runs:

read: 301 IOPS (lowest out of seven: 273)
write: 209 IOPS (lowest out of seven: 192)

This is certainly better than what I had before (#8836 (comment);

The read speed is very good. Same level or better than 0.7.13. But the write speed is still behind 0.7.13.

@msLinuxNinja
Copy link

@mabod out of curiosity, how did you run the benchmark? just want to compare results.

@mabod
Copy link
Author

mabod commented Jan 27, 2020

I explained it in this thread. It is a fio benchmark. The fio option files are in this thread too.

@lucasRolff
Copy link

@interduo - I moved my backup servers to 100% SSD storage and (sadly) using a hardware raid 6 :)

Eventually, I'll give ZFS a try again on spinning disks and see how it performs.

@FlorianHeigl
Copy link

does not sound like SIMD is the only problem with this

zfs striped mirror restore: 3 min 39 sec
zfs striped mirror restore: 4 min 45 sec

@interduo
Copy link

@FlorianHeigl
@mabod

did You do Your tests on 0.8.4 release? Could You post results?

@mabod
Copy link
Author

mabod commented May 20, 2020

I can not compare my test results anymore because I have replaced all 4 HD in that RAID10 in the meantime. Sorry.

@FlorianHeigl
Copy link

FlorianHeigl commented May 20, 2020

@interduo I was thinking that the SIMD issue would only really affect RaidZ/compression/encryption but not a mirror, and so it might be something else.
Re-reading this now, I don't think that is actually the case.

I'm not sure if I can quickly run a few tests, if yes, I'll update.

@RJVB
Copy link

RJVB commented Jul 19, 2020

I'm a bit late to this party...

For those of us building our own kernel for private use, is it possible to avoid "the SIMD issue" by reintroducing the symbols that are no longer exported and, if so, how would you do that?

I've been running 0.8.4 on 4.14.23 for about a week now (with the impression that reads seem a bit faster compared to 0.7.12, writing probably slower, judging from compiler job durations). I'm building kernel 4.19.133 as we speak so now would be a good time to restore those SIMD exports...

@RJVB
Copy link

RJVB commented Jul 20, 2020 via email

@mskarbek
Copy link
Contributor

@RJVB it is but not in 4.14.0, change was made as a backport to some later version, I can't remember which one right now.
There is also a second patch for the newer kernels: https://github.com/NixOS/nixpkgs/blob/master/pkgs/os-specific/linux/kernel/export_kernel_fpu_functions_5_3.patch

@RJVB
Copy link

RJVB commented Jul 20, 2020 via email

@mskarbek
Copy link
Contributor

@RJVB yes, OpenZFS checks individually each kernel functionality during the build process regardless of the kernel version.

@RJVB
Copy link

RJVB commented Jul 20, 2020 via email

@RJVB
Copy link

RJVB commented Jul 22, 2020

I take it this patch has been tested with ZFS?

After working around the build failure I could finally boot a VM into my new 4.19 kernel, with the ZFS 0.8.4 kmods ready to roll. The VM runs under VirtualBox, using "raw disk" access to actual external drives connected via USB3. When I imported a pool (created recently by splitting off a dedicated mirror vdev from my main Linux rig's root pool) I discovered it had a number of corrupted items.

I don't know if the corruption occurred during the previous time I'd used that pool, or during import. The identified items were all directories, curiously, (in a dataset that has copies=1 because it has its own registry that doubles as an online backup) and the errors could be clear by making an identical copy (cp -prd /path/to/foo{,.bak}) and then replacing the original with that clone. I don't have the impression I lost anything... The remaining items don't seem to correspond to existing files, some are of the type "metadata:".

Can I suppose that every single directory on (at least) every single dataset with copies=1 would have been affected if this were due to an issue with my kernel patches *) or the workarounds I applied to get the ZFS kmods to build?

*): I also use the ConKolivas patches (which I had to refactor for 4.19.133) and a patch to make zswap use B-Trees.

@TimLand
Copy link

TimLand commented Nov 24, 2020

mark

@mabod
Copy link
Author

mabod commented May 21, 2021

I am closing issue. This was for version 0.8.0 which is obsolete since a long time.

@mabod mabod closed this as completed May 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests