Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance on 0.6.3 #3261

Closed
chjohnst opened this issue Apr 6, 2015 · 11 comments
Closed

Poor performance on 0.6.3 #3261

chjohnst opened this issue Apr 6, 2015 · 11 comments
Labels
Type: Documentation Indicates a requested change to the documentation Type: Performance Performance improvement or performance problem

Comments

@chjohnst
Copy link

chjohnst commented Apr 6, 2015

Currently running 0.6.3 on CentOS6.6 where I am seeing some pretty poor performance with various workloads on my ZFSoL setup. I am using a slighter older hardware setup here, 60 bay JBOD with 6Gb/s connectors with all NL-SAS 3.5" 7200 RPM drives. I have two SAS cards connected to the JBOD where I am doing an active/active setup to my paths. Previously this host was running Solaris based ZFS and we converted it over to ZFSoL and our performance has not beenvery good.

A few things I have tried:

  • setting my arc cache to 1/4 to 2/3 of memory
  • installed cache drives for reads
  • modfied sync=disabled and standard
  • bumped my nfs thread counts up to 32 from the defaults
  • clients are mounting async and nfsv4
  • compression=lz4
  • dedupe off
  • played with enabling and disabling prefetch no change
  • adjusting the record size has helped somewhat but this will vary depending on my workloads, I am trying to improve my NFS client access for users doing lots of small reads/writes
  • (5) 8+3 raid-z disks in a single pool for a single namespace

Below is the FIO command I have tried, seq, randrw, etc, 4k block to match kernel page size. I guess my confusion here stems from the fact that my local R1 mirror outperforms 60 spindles by an order of 4-5x.

fio --directory=/foo --name=abc --rw=seq --bs=4k --size=10G --numjobs=1 --time_based --runtime=60

Any suggested tuning in ZFS that I am overlooking?

@GregorKopka
Copy link
Contributor

  • Since there are some shortcomings in memory handling it is currently recommended to set ARC lower than 1/2 of memory.
  • Cache drives will only help in case the data is still somewhat hot when evicted from ARC into L2ARC.
  • Cache performance will (given that access patterns stay somewhat consistant) increase with uptime.
  • For random r/w performance mirrors work better, because a raidZ vdev behaves for random IOPS like a single drive (while mirrors can at least serve reads from every drive in the vdev).
  • CoW has overhead compared to other filesystems, overwrites and fsync are more expensive than on traditional filesystems (because all metadata up to the top has to be updated/flushed to disk).
  • Synthethic benchmarks can't tell about real world performance in multi-user scenarios.
  • A good place to ask such general questions is the mailing list, not the defect tracker

@chjohnst
Copy link
Author

Thanks for the reply, I would agree synthetic benchmarks are not always
real world, however my reasoning for this post here is the fact that my
real world (in this case NFS) performance is very bad on top of ZFS.

I am convinced I am reaching some fundamental flaw or bug in ZFS that I was
hoping to get some clarity on. When my users attempt to access the
filesystem over nfs, and even locally I am seeing long pauses in the
txg_sync thread spinning which eventually blocks my nfsd threads. I have
seen a few other people posting issues similar to this in previous versions
of ZFS.

Are you saying because I am using raidz this could be causing my
performance issues for random data access patterns? Moving to a RAID10
setup is not totally ideal here but worth experimenting. Or should I just
accept the reality here that ZFS is not fit for my NFS workloads and should
just move on?
On Apr 9, 2015 3:53 PM, "Gregor Kopka" [email protected] wrote:

  • Since there are some shortcomings in memory handling it is currently
    recommended to set ARC lower than 1/2 of memory.
  • Cache drives will only help in case the data is still somewhat hot
    when evicted from ARC into L2ARC.
  • Cache performance will (given that access patterns stay somewhat
    consistant) increase with uptime.
  • For random r/w performance mirrors work better, because a raidZ vdev
    behaves for random IOPS like a single drive (while mirrors can at least
    serve reads from every drive in the vdev).
  • CoW has overhead compared to other filesystems, overwrites and fsync
    are more expensive than on traditional filesystems (because all metadata up
    to the top has to be updated/flushed to disk).
  • Synthethic benchmarks can't tell about real world performance in
    multi-user scenarios.
  • A good place to ask such general questions is the mailing list, not
    the defect tracker


Reply to this email directly or view it on GitHub
#3261 (comment).

@GregorKopka
Copy link
Contributor

A raidZ vdev exposes roughly the read IOPS of a single disk, so the layout of your pool will give you ~600 random IOPS (given they are evenly distributed over all 5 vdevs). Read seeks for cold data (like diectory listings) can saturate the available seek capacity of your system. Maybe take a look at your drives using dstat --disk-util (or something giving you insights about how busy your drives are).

Mirror setups are faster for reads, since every disk can service reads indepenant - raidz needs most liekly to read from several spindles to get the data.

So in case you had structured your pool as 3-way mirrors (to have near the same redundancy as you have with raidZ3, but since mirror resilvers are faster since all remaining drives can supply data so the new drive will most likely be 100% busy writing my stand is to get away with one redundant drive less) you would only have 18 drives worth of space (compared to 40 of your current setup) but ~2160 writes/s or ~6480 reads/s instead of the ~600 you have now (roughly 3.5-10.5 times more iops).

Since you didn't specify how big the disks are: in case you have <1TB drives and use mirrors you could add one SSD (relatively cheap samsung or the like) to each mirror to boost reads/s by several orders of magnitude (since zfs will favour faster devices for reads, sadly full potential of this isn't archived because ZFS dosn't care so far about if the disk is spinning media or not: so it'll fetch metadata from slow drives too which delays reads which need the info from the metadata).

Other option is to put the hot data on a mirror (or HDD+SSD mirror vdevs) pool and use a raidZ pool for archival (send/recv).

Adding SLOG devices (fast SSDs) could help in case you're blocked by fsync calls.

Cheap (per TB of storage) vs. Fast (IOPS) vs. Reliable (redundancy)
Pick two, currently you have cheap&reliable (raidz3).

Nevertheless, I would suggest you upgrade to the newly released 0.6.4, the release notes are longer than both my arms (in terms of improvements/fixes) so there most likely will be some bottlenecks removed (compared to 0.6.3).

@behlendorf
Copy link
Contributor

@chjohnst if your issue is NFS performance over ZFS then you'll definitely want to upgrade to the recently released 0.6.4 release. Performance was improved substantially by implementation AIO support in ZFS. See commit cd3939c for additional details.

@behlendorf behlendorf added Type: Documentation Indicates a requested change to the documentation Type: Performance Performance improvement or performance problem labels Apr 10, 2015
@chjohnst
Copy link
Author

@behlendorf thanks for the suggestion. I just upgraded my server to it last night and have started running some tests on it today. Mind you this is my enterprise tier-3 storage NAS box for 150 of my users so I was a bit cautious to upgrade so I did some tests on another server first and no issues with upgrade on CentOS6.6 (zpool upgrade worked fine as well).

I just did some quick tests on the largish zpool I have setup with 3TB 7200RPM drives, (5) 8+3 zdevs with the following settings. Seems like @GregorKopka is indicating that 3-way mirrors would be better for random IO patterns then I am currently getting which is an interesting setup, but I am not sure I can take the hit on space with only 18 usable devices vs 40. But might something worth testing to see if it fills some use cases down the road.

  • compression=lz4
  • dedupe=off
  • noatime=off
  • relatime=off
  • xattr=sa
  • primarycache=all
  • secondarycache=all
  • 500GB LOG on SSD
  • arc_max set to 60GB on a system with 512GB of memory (should I go smaller?)

With the default of sharenfs=on it took over 27+ mins to untar a source copy of gcc-4.9.2.tar.gz. What I noticed while the untar was keeping the LOG device very busy. So the first thing that I thought was even though my nfs client is mounted async, but zfs was still doing sync writes as data was coming in.

I then re-exported my dataset with sharenfs="async,rw", remounted my client and I am seeing better NFS client performance for obvious reasons, the untar took no more then 1+ mins. I am just surprised to see how poor the sync writes are in ZFS in comparison to other filesystems that I have tested against. I guess I could also run sync=disabled as well.

@chjohnst
Copy link
Author

@kpande Intel SSD, behind a CCISS controller in a R1 mirror. A simple randrw test in FIO pushes around 5k IOPs with a 4k block size, and a sequential writer around 250k IOPs without an issue. I wouldnt say this is the fastest SSD available to me but still pretty decent performance. As a comparison with other (newer) SSD in systems across my environment, I can easily hit 5-10k IOPs there, so I can see how this can bottleneck my sync writes just a little but I am not totally convinced. I guess the fastest way to prove your theory would be to disable the ZIL (sync=disable) or find faster disk to try and see if that helps.

@chjohnst
Copy link
Author

@kpande actually in this system.. I am using OCZ talos 2 drives.

http://ocz.com/enterprise/talos-2-sas-ssd

@GregorKopka
Copy link
Contributor

500GB of SLOG is a waste.
SLOG sizw should be size of writes per second * txg time, so usually a few GB are enough.

About fsync performance:
#2431
#1012

since you're on 0.6.4 now and still have problem with fsync performance: could you update the OP title to reflect the problem for easier tracking (so it isn't closed by accident since 0.6.3 is history now)?

@chjohnst
Copy link
Author

Hey Gregor,

Thanks for the info there, I ended up throwing in an 8GB Zeus drive in a
mirror into my JBOD. Sacrificing losing two spares seems worth it.

Seeing much better performance now on single threaded sync writes from
nfs. Still can't beat the async performance of course.

I think this is slightly better except the metadata operations over nfs are
killing me.
On Apr 13, 2015 2:59 AM, "Gregor Kopka" [email protected] wrote:

500GB of SLOG is a waste.
SLOG sizw should be size of writes per second * txg time, so usually a few
GB are enough.

About fsync performance:
#2431 #2431
#1012 #1012


Reply to this email directly or view it on GitHub
#3261 (comment).

@chjohnst chjohnst closed this as completed Dec 5, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Documentation Indicates a requested change to the documentation Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests

4 participants
@behlendorf @GregorKopka @chjohnst and others