-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NVMe Read Performance Issues with ZFS (submit_bio to io_schedule) #8381
Comments
This comment has been minimized.
This comment has been minimized.
When testing ZFS, what was your block devices IO scheduler set to?
Normally it should be set to noop (which ZFS should set on its own if you've created your vdevs using full drives, which is recommended) so as to let ZFS do the heavy lifting instead of regular in-kernel IO schedulers. In some cases however, it can be necessary to set the scheduler yourself. I'm not saying it'd necessarily explain your issues or fix them, but it might be a relevant piece of the puzzle here. |
In the data I have presented all the NVMe drive schedulers were set to none under /sys/block/nvme#n1/queue/scheduler. Also the ZFS module parameter for zfs_vdev_scheduler was set to noop. I think only achieving half of the available bandwidth for synchronous reads is more than just under performance at this point. I could see only getting 75-80% as under performance; however, 50% is just way too low. I have been continuing to try and track this issue down. I have taken my time stamp approach and placed time stamps in ZFS dmu.c (dmu_buf_hold_array_by_dnode), vdev_disk.c (vdev_submit_bio_impl), and SPL spl-condvar.c (cv_wait_common). My idea is try and narrow the scope down of what is causing such a drastic latencies in ZFS between when a read enters the ZFS software stack, to submitting the bio request to the linux block mulitqueue layer, and finally when the io_schdule call occurs. I am however seeing an issue where I am not collected the same number of time stamps in the vdev_submit_bio_impl call as the other 2 call sites for 1 MB request for a total of 128 GB of data with 24 I/O threads. I added a variable inside of the zio_t struct to flag the zio as a read and set this variable to true, for reads, inside of the dmu_buf_hold_array_by_node call right after the zio_root call occurs. I also updated the zio_create function to add the flag from the parent zio to the child if a parent is passed. Not sure why I am seeing less time stamps collected in just this one function, but I working to try and figure that out now. I am sticking with this time stamp approach as nothing is blatantly obvious, when reading the source code, as the cause of the latencies. I thought maybe the issue might be occurring the SPL taskq's; however, for synchronous reads, there are 8 queues with 12 threads each to service the checksum verify in the read zio pipeline as well as finally calling zio_wait to hit the io_schedule call. Hopefully collecting the timestamps and narrowing down the area of collection will lead to more insight to why this issue is present. Any advice on places to look in the source code or even why I am only getting a fraction of the time stamps collected in the vdev_submit_bio_impl call would greatly be appreciated. |
Have you tried increasing max number of I/O requests active for each device? |
Yes, I have done this for zfs_vdev_sync_read_max_active and zfs_vdev_max_active. Currently the amount of work does not seem to be causing the poor synchronous read performance. ZFS seems more than capable of handing off enough work to the NVMe SSD drives (even with default module parameter values). The real issue is that the requests are sitting the devices hardware queues for far too long (aka between when ZFS issues the requests and finally asks for the data). This is why I am trying to figure out what is causing the large latencies in the source code between calls to submit_bio and io_schedule for a synchronous read requests. |
NVME devs still have queue depth, maybe it's some issue with that (or other controller issues) |
@bwatkinson Did you also try increasing zfs_vdev_sync_read_min_active? |
@jwittlincohen I did not increase the min. Is there a reason why you think this would decrease the latency issue I am seeing with read requests sitting the NVMe Device queues? @prometheanfire The data (on the Google Drive) I presented in the first post shows that the hardware and hardware paths seem perfectly fine. XFS easily gets the full NVMe SDD's bandwidth for reads for the exact same workloads. Of course I am using Direct I/O with XFS. Also, directly reading from the devices I do not experience the same issue I am seeing in ZFS. That is also why I believe this has something to do with the ZFS software stack. I have tried using the ZFS 0.8.0-rc which allows for Direct I/O but this same issue is still present. |
I just wanted to give an update on this and see if there was any other suggestions where I should narrow my focus down in the ZFS source code to resolve the latency issues I am see with NVMe SDD's for synchronous reads. I was able to solve where the missing timestamps were for the submit_bio calls. I had to trace the zio_t's to the parent from the children to get the correct matching PID's. I found that some of the submit_bio calls were being handled by the kernel threads in the SPL taskq's. I have attached the new results in the hrtime_dmu_submit_bio_io_schedule_data.zip. These timestamps were collected using a single ZPool in a RAIDZ0 configuration using 4 NVMe SDD's. The three call sites, where the timestamps were collected, were in the functions dmu_buf_hold_array_by_dnode, vdev_submit_bio_impl, and cv_wait_common. The total data read was 128 GB using 24 IO threads with each read request being 1 MB. I set the primarycache to none and set the recordsize to 1 MB for the ZPool. In general there is a larger mean, median latency between the submit_bio to the io_schudule calls. Surprising, it is not as significant as I was expecting in comparison between the dmu_buf_hold_array_by_dnode call site and the submit_bio call site. One area I currently focusing on is the SPL taskq's. I have noticed the read queue is dynamic. I am planning on statically allocating all the kernel threads for this queue when the ZFS, SPL module is loaded to see if this has any effect. Any suggestions or advice related to places to focus on in the source code would be greatly be appreciated. |
Also, I meant to say having the kernel threads in the read taskq statically allocated when the ZPool is created. |
@bwatkinson the taskq's are a good place to start investigating. I'd suggest setting the I'd also suggest looking at the number of threads per taskq in the I/O pipeline. You may be encountering increased contention of the taskq locks due to the very low latency of your devices. Increasing the number of tasks and decreasing the number of threads may help reduce this contention. There is no tuning for this, but it's straight forward to make the change in the source here. |
@behlendorf that is the exact source file I am currently manipulating to conduct further testing. I am glad to know I wasn't that far off. At the moment I am just searching for the z_rd_int_* named taskq's in spa_taskqs_init (spa.c) and removing the TASKQ_DYNAMIC flag as a quick test. I wasn't sure if I should mess around with the other taskq's and their dynamic settings at the moment, but I will also disable the spa_taskq_thread_dynamic module parameter with further testing to see if this resolves the present issue. The contention issue also makes sense. I will explore adjusting the number of queues as well as the threads per queue. Thank you for you help with this. |
Here's a couple of comments from my observations over time which have been gleaned mainly from running various tests and benchmarks on tmpfs-based pools: First off, on highly-concurrent workloads, as @behlendorf alluded to, the overhead in managing dynamic taskqs can become rather significant. Also, as the devices used for vdevs become ever faster and lower latency, I think the overhead, in general, of the entire zio pipeline can start to become a bottleneck w.r.t. overall latency. I've also discovered that builds containing 1ce23dc (which is not in 0.7.x) can experience a sort-of fixed amount of latency due to In general, however, it seems that as devices get ever faster and have ever lower latency, some of the rationale for pipelining starts to go away and the overhead in doing so becomes an ever larger contributor to user-visible latency. I think this is an area that's ripe for much more extensive performance analyses and it seems that @bwatkinson's work is a good start. |
primarycache=none will kill your read performance. ZFS normally brings in a fair amount of metadata from aggregated reads that you will not see any advantage from. Use primarycache=all and put in a smaller ARC max if you don't like how much memory it takes. Other settings of primarycache are only to be used in dire situations, they are major warning signs that something is wrong. If you don't have a SLOG, you likely have a pool full of indirect writes. These will also kill read performance, as data and metadata end up fragmented from each other. Get zpool iostat -r data while you are running a zfs send and while you are running a zfs receive, and post your kernel config and zfs get all - these will help significantly. I suspect you will see a lot of unaggregated 4K reads. If you set the vdev agg max to 1.5m or so and they still don't merge, then you have fragmented metadata. zfs send | receive it to another data set and repeat the test, if it helps significantly then you know what the problem is. Hope this helps. |
@janetcampbell I am no way implying that we are not using the ARC when we run our normal ZPool configurations. Reads requested serviced from the ARC are obviously preferable; however, the issue is that not all reads will be serviced from the ARC (AKA hitting the underlying VDEV NVMe SDD's). When the ZFS code path for reads that are hitting the VDEV devices queues occur, only getting half of the total available bandwidth of the underlying devices is achieved. Inevitably not all read requests will be serviced at the ARC and that is where the primary concern is. This is why I set primarycache=none to help track down in the ZFS/SPL code path where such large latencies are occurring. |
I have also added more data points to the Google drive: https://drive.google.com/drive/u/2/folders/1t7tLXxyhurcxGjry4WYkw_wryh6eMqdh Unfortunately, adjusting the SPL interrupt queues (Number of Threads vs Number of Queues and Static Allocation) did not have a significant impact. I have moved on to trying to identify other possible sources of contention within the ZFS pipeline source at this point. It has been brought to my attention from others, experiencing this same issue, that the read performance dips significantly as more VDEV's are added to the ZPool. I have started to explore why this might be the case as this might lead to uncovering what is leading to the performance penalties with NVMe VDEV's. |
@bwatkinson I've been looking at this as well, though not yet to the level of your investigation. We would like to use the NVMe for an L2ARC device but when that didn't give us the performance we were expecting I started experimenting with the performance of the NVMe on its own. I'm also seeing 50% or so of expected performance. I'm curious if you've found any additional leads since your last post? |
Something I've been meaning to test for quite awhile on pools with with ultra low-latency vdevs, but haven't had the time to do so yet is to try various combinations of:
It would be interesting to see whether any of these sufficiently lower latency to measurably impact performance on these types of pools. See the second and third paragraphs in this comment for rationale. I'm mentioning it here as a suggestion to anyone currently performing this type of testing. |
@dweeezil I've been doing read testing with fio using ZFS on an NVMe and it actually looks like read bandwidth dropped about 15-20% when I added those options to the boot config. I tried a couple different combinations (including max_cstate=1) and performance either didn't change or dropped 15-20%. |
@dweeezil @joebauman I have tried multiple kernel settings myself and found they either had negligible effects or actually reduced ZFS NVMe read performance. @joebauman I am still working on solving this issue. I looked into the Space Map allocation code to see if I could find anything that would be causing this issue, but nothing in particular stood out. I started hunting in this part of the source when discovering that NVMe read performance levels out completely when more than 2 VDEV's are in a single ZPool. I am working on collecting individual timings between each of the ZIO pipeline stages at the moment to see if I can exactly narrow down where in the pipeline read requests are stalling out leading to these low performance numbers. Hopefully will be sharing some of those results soon, but I am working on getting a new testbed up and running before giving credence to any results I collect going forward. |
Setting IMHO, the jury is still out on polling, this is an area of relatively intense research as more polling drivers are becoming available (DPDK, SPDK, et.al.) |
@bwatkinson Have you made any progress in finding the cause? How are the NVMe SSDs running for you now with the newer version of ZFS? |
@recklessnl we actually have discovered that the bottle neck had to do with the overhead of memory copies. I am working the @behlendorf and @mmaybee on getting Direct IO to work in ZFS 0.8. The plan is to get things ironed out and make an official pull request against zfsonlinux/zfs master. I was planning on updating this ticket once an official pull request has been made. |
@bwatkinson That's very interesting, thanks for sharing. Wasn't Direct I/O already added in ZFS 0.8? See here #7823 Or have you found a bug with the current implementation? |
@recklessnl so the addition in 0.8 was just to accept the O_DIRECT flag; however, the IO path for both reads/writes remained the same (AKA they traveled through the ARC). If you run ZFS 0.8 and use Direct IO you will see the ARC is still in play (watch output from arcstat). This still means memory copies are being performed and limiting performance. We are actually working on implementing Direct IO by mapping the user’s pages into kernel space and reading/writing to them directly. This allows us to bypass the ARC all together. |
Curious why would you want to bypass the ARC? There are other problems if accessing through it is slower. Please never make that default. |
@h1z1 the ARC bypass is a side effect of the Direct IO support (although one could argue that it is implied by "direct"). The primary goal was to eliminate data copies to improve performance. Because the Direct IO path maps the user buffer directly into the kernel to avoid the data copy, there is no kernel copy of the data made so it cannot be cached in the ARC. That said, there are good reasons to avoid the ARC sometimes: If the data being read/written is not going to be accessed again (known from the application level), then avoiding the cache may be a benefit because other, more relevant, data can be kept in cache. Direct IO will not be the default. It must be explicitly requested from the application layer via the O_DIRECT flag. |
@h1z1 as @mmaybee said, this will not be default behavior in ZFS going forward. It has to be explicitly requested from the application layer. There has been quite a bit of concern on this issue about bypassing the ARC or avoiding it all together. I think it is also important to remember the one of the key purposes of the ARC, which is to hide disk latency for read requests. That is what is unique about NVMe SDD’s though, the latency is no longer a giant bottleneck. Essentially by trying to mask latency from a NVMe device, we are actually inducing a much higher latency overhead. Even using Direct IO with reads from these devices well out paces any performance gains from caching reads or prefetching them. Direct IO will allow ZFS another path to work with very low latency devices while still providing all the data protection, reduction, and consistency that one expects from ZFS. |
well, not quite that fast. Today, the most common NVMe devices write at 60-90 usec and read at 80-120 usec. That is 2-3 orders of magnitude slower than RAM. The bigger win will be more efficient lookup and block management in ARC. |
Any new info on this front? |
I guess I'll add myself more explicitly to this, have a storage server that's all nvme that I'd prefer to use zfs on as well. |
@fricmonkey, |
@bwatkinson Thanks for the update! With the state of zfs and nvme as it is now, I will likely be holding off putting the machine into our production environment. Whenever you have something ready to test, I'd be happy to be involved in testing and experimentation if one of you could help walk me through it. |
Just wanted to add that I've been seeing similar performance issues with spinny drives. I have tried numerous experiments with various file systems and zfs tweaks on my hardware, including 8 WD Blue 6TB drives and a second set of 8 WD Gold 10TB drives (to eliminate the possibility of SMR being the culprit) but nothing has solved the problem. With 8 drives at 120-260MB/s each I would expect around 2GB/s throughput in striped configuration (or mirrored reads) but I never get over around 1GB/s. Almost exactly half the expected performance. I also suspect a software issue and found this bug in my search for possible answers. Anyway, plus one to this, I hope it fixes my issues as well! Would love to see better read performance from my ZFS nas. Some details here: https://www.reddit.com/r/freenas/comments/fax1wl/reads_12_the_speed_of_writes/ |
@randomsamples it is unlikely that you are facing that specific problem with spinning disks. Rather, single-threaded sequential access does not scale linearly with stripe count, as in this case the only method to extract performance is via read-ahead. You can try increasing your read-ahead setting by tuning |
HI @shodanshok, thank you for the tip. II did try this out today and saw a nice gain, but writes are still way faster than reads on my array which I find quite surprising. FWIW this is 8 WD Gold 10 TB drives, 4 sets of mirrors, the drives will do up to 230MB/s each, so I would expect about 2GB/s reads and 1GB/s writes in this configuration (raw dev reads/writes demonstrate this). Dataset is standard sync, no compression, atime off, dedupe off, 1M records, case insensitive (SMB share), and in this case I have SLOG ssd and L2Arc attached (its a working pool so I didn't want to change too much but I used huge files to make sure cache is not making perf magic happen). I started with |
Does your DirectIO still do zfs checksums? |
Checksums will still be supported; however, the buffer will be temporarily copied on writes but not cached in the event that checksums are enabled. This will affect DirectIO requests going through the normal ZPL path (write system call), but we can avoid the temporary memory copy in the case of DirectIO with Lustre. I have attached a link (https://docs.google.com/document/d/1C8AgqoRodxutvYIodH39J8hRQ2xcP9poELU6Z6CXmzY/edit#heading=h.r28mljyq7a2b) to the document @ahrens created discussing the semantics that we settled on for DirectIO in ZFS. I am currently in the processes of updating the PR #10018 to use these new semantics. |
I want to suggest the following to the author of the topic: I actually ran into this problem on an old Samsung 2TB hard drive. Pure linear reading from the start starts at 118mb/s. By the end of the disc - 52. |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
We have a new server that's all nvme (12 disks of 8TB each) that I'd like to use ZFS for its on-the-fly compression to squeeze more storage out of those disks, that are quite expensive. With ZFS we are thinking of on Vdev RaidZ2 with all those 12 disks for 2 Vdev RaidZ1 in stripe mode. Our other alternative is hybrid raid with VTOC built into the Intel CPU with LVM (2 raid-5 in stripe mode) and XFS. We are on Centos 7. So what is the current nvme support state with ZFS on Centos 7, in terms of performance, drive longevity (write amplification on ssd) and data integrity (write hole with raid in case of power failure)? Thanks. |
Write same slow: Increase zfs_dirty_data_max (4294967296 -> 10737418240 -> 21474836480 -> 42949672960) compensate performance penalties, but this is background record same slow per nvme devices ~10k iops per device:
After test backround writing is still process:
Than single nvme device have raw speed 700k iops:
|
Also increase: increased iops:
|
Increasing Changing |
As I understand it, we are discussing performance on nvme ssd
Increasing
I used it and Also I increase From these results, the conclusion suggests itself: ZFS groups 4kb blocks into 30-50kb block for writing to device in raidz |
Increasing |
For example? (note: I read you article about ZIO scheduling)
Can you show in the code exactly how this might affect?
I create mirror instead raidz, same result - w/o tune
and one more time:
|
Your test workload is a write-only workload with no overwrites. So you won't see the penalty of raising the _min. |
File /mnt/zfs/g-fio.test not recreated when running multiple tests (exclude creating new fs dataset). Each fio run overwrite data in this file. Tests runned many times. |
For those working on this for years: is there some consensus for the best settings, or are they too configuration specific to state? As I understand it, we should not expect ZFS to automatically set the best defaults based on what hardware is being attached (HDD, SDD, NVME), correct? I'm curious if most ZFS devs are using ZFS on their PCs - considering all of their PCs are likely to have NVMEs by now. Dogfooding can be inspiring :) Reading through this today, and found it interesting: https://zfsonlinux.topicbox.com/groups/zfs-discuss/T5122ffd3e191f75f/zfs-cache-speed-vs-os-cache-speed |
Any updates? Thanks. |
@0xTomDaniel It is not exactly a DirectIO, but a nice small sampler was just integrated: #14243 . It can dramatically improve performance in case of primarycache=metadata. |
My current settings for servers with 2 TiB RAM and 2x 7.68TB (SSDPE2KE076T8) NVMe drives in mirrored pool:
|
System information
Describe the problem you're observing
We are currently seeing poor read performance with ZFS 0.7.12 with our Samsung PM1725a devices in a Dell PowerEdge R7425.
Describe how to reproduce the problem
Briefly describing our setup, we currently have four PM1724a devices attached to the PCIe root complex in NUMA domain 1 on an AMD EPYC 7401 processor. In order to measure read throughput of ZFS, XFS, and the Raw Devices, the XDD tool was used which is available at:
[email protected]:bwatkinson/xdd.git
In all cases I am presenting, kernel 4.18.20-100 was used and I disabled all CPU's not on socket 0 within the kernel. I also issued asynchronous sequential reads to the file systems/devices while pinning all XDD threads to NUMA domain 1 and Socket 0's memory banks. I conducted four tests consisting of measuring throughput for the raw devices, XFS, and ZFS 0.7.12. For the Raw Device tests, I had 6 I/O threads per devices with a request sizes of 1 MB and a total of 32 GB read from each device using Direct I/O. In the XFS case, I created a single XFS file system on each of the 4 devices. In each of the XFS file systems, I read a 32 GB file of random data using 6 I/O threads per file with request sizes of 1 MB using Direct I/O. In the ZFS Single ZPool case I created a single ZPool composed of 4 VDEVs and read 128 GB file of random data using 24 I/O threads with request sizes of 1 MB. In the ZFS Multiple ZPool case I create 4 separate ZPools each consisting of a single VDEV. In each of the ZPools, I read a 32 GB file of random data using 6 I/O threads per file with request sizes of 1 MB. In both the Single ZPool and Multipl ZPool cases I set the record sizes for all pools to 1 MB and I set the primarycache=none. We decided to disable the ARC in all cases, because we were reading 128 GB of data, which was exactly equal to 2x available memory on Socket 0. Even with the ARC enabled we were seeing no performance benefits. Below are the throughput measurements I collected for each of these case.
Performance Results:
Raw Device - 12,7246.724 MB/s
XFS Direct I/O - 12,734.633 MB/s
ZFS Single Zpool - 6,452.946 MB/s
ZFS Multiple Zpool – 6,344.915 MB/s
In order to try and solve what was cutting ZFS read performance in half, I generated flame graphs using the following tool:
http://www.brendangregg.com/flamegraphs.html
In general I found most of the perf samples were occurring in zio_wait. It is in this call that io_scheudle is called. Comparing the ZFS flame graphs to XFS flame graphs, I found that the number of samples between the submit_bio and io_schedule was significantly larger in the ZFS case. I decided to take timestamps of each call to io_schedule for both ZFS and XFS to measure the latency between the calls. Below is a link to to histograms as well as total elapsed time in microseconds between io_schedule calls for the tests I described above. In total I collected 110,000 timestamps. In plotted data the first 10,000 timestamps were ignored to allow for the file systems to reach a steady state.
https://drive.google.com/drive/folders/1t7tLXxyhurcxGjry4WYkw_wryh6eMqdh?usp=sharing
In general, ZFS has a significant latency between io_scheudule calls. I have also verified that the output from iostat shows a larger await_r value for ZFS over XFS for these tests as well. In general it seems ZFS is letting request sit in the hardware queues longer than XFS and the Raw Devices causing a huge performance penalty in ZFS reads (effectively cutting the available device bandwidth in half).
In general, has this issue been noticed with NVMe SSD’s and ZFS and is there a current fix to this issue? If there is no current fix, is this issue being worked on?
Also, I have tried to duplicate the XFS results using ZFS 0.8.0-rc2 using Direct I/O, but the performance for 0.8.0 read almost exactly matched ZFS 0.7.12 read performance without Direct I/O.
Include any warning/errors/backtraces from the system logs
The text was updated successfully, but these errors were encountered: