Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split dmu_zfetch() speculation and execution parts. #11652

Merged
merged 3 commits into from
Mar 20, 2021

Conversation

amotin
Copy link
Member

@amotin amotin commented Feb 25, 2021

To make better predictions on parrallel workloads dmu_zfetch() should
be called as early as possible to reduce possible request reordering.
In particular, it should be called before dmu_buf_hold_array_by_dnode()
calls dbuf_hold(), which may sleep waiting for indirect blocks, waking
up multiple threads same time on completion, that can significantly
reorder the requests, making the stream look like random. But we
should not issue prefetch requests before the on-demand ones, since
they may get to the disks first despite the I/O scheduler, increading
on-demand request latency.

This patch splits dmu_zfetch() into two functions: dmu_zfetch_prepare()
and dmu_zfetch_run(). The first can be executed as early as needed.
It only updates statistics and makes predictions without issuing any
I/Os. The I/O issuance is handled by dmu_zfetch_run(), which can be
called later when all on-demand I/Os are already issued. It even
tracks the activity of other concurrent threads, issuing the prefetch
only when all on-demand requests are issued.

For many years it was a big problem for storage servers, handling
deeper request queues from their clients, having to either serialize
consequential reads to make ZFS prefetcher usable, or execute the
incoming requests as-is and get almost no prefetch from ZFS, relying
only on deep enough prefetch by the clients. Benefits of those ways
varied, but neither was perfect. With this patch deeper queue
sequential read benchmarks with CrystalDiskMark from Windows via
iSCSI to FreeBSD target show me much better throughput with almost
100% prefetcher hit rate, comparing to almost zero before.

While there, I also removed per-stream zs_lock as useless, completely
covered by parent zf_lock. Also I reused zs_blocks refcount to track
zf_stream linkage of the stream, since I believe previous zs_fetch ==
NULL check in dmu_zfetch_stream_done was racy.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@amotin amotin added the Status: Code Review Needed Ready for review and testing label Feb 25, 2021
@amotin amotin requested review from pcd1193182 and ahrens February 25, 2021 16:13
@amotin amotin self-assigned this Feb 25, 2021
@behlendorf behlendorf added the Type: Performance Performance improvement or performance problem label Feb 25, 2021
@amotin
Copy link
Member Author

amotin commented Feb 28, 2021

I hit few rare panics while testing this deeper, so I've added some more locking.

@amotin
Copy link
Member Author

amotin commented Mar 1, 2021

While being in the speculative prefetcher area I've got two more ideas,
pushed in separate commit:

Delete prefetch streams when they reach end of files. It saves up
to 1KB of RAM per file, plus reduces searches through the stream list.

Block data prefetch (speculation and indirect block prefetch is still
done since they are cheaper) if all dbufs of the stream are already
in DMU cache. First cache miss immediately fires all the prefetch
that would be done for the stream by that time. It saves some CPU
time if same files within DMU cache capacity are read over and over.

@ahrens
Copy link
Member

ahrens commented Mar 1, 2021

These ideas all sound good to me. I'll take a look at the code this week. Do you have any performance test results to show the impact of the locking changes, etc?

@tonynguien could we run our performance tests with this PR to determine if there's any impact on our test cases? I imagine the relevant ones would be sequential read and (maybe) write, but it would be good to run the whole zfs performance test suite to see if the change in CPU overhead has any effect.

@amotin
Copy link
Member Author

amotin commented Mar 1, 2021

Do you have any performance test results to show the impact of the locking changes, etc?

I did many tests (sequential and deep random) on zvols with 16KB volblocksize to stress it more, but haven't noticed significant locking difference. If there is a prediction miss, then the overhead of the code is pretty low, and I'd more blame stream reclamantion code, getting current time, using division and incrementing with atomics global miss counter. We definitely should start using more SMP-aware counters like FreeBSD's counter(9) for statistics rather than atomics on global variables. If there is a prediction hit, then the zfetch code is barely visible comparing to dbuf_prefetch_impl() and its descendants, that is why I've implemented the skipping it in case of cache hits.

What we should definitely improve, I think, is a stream reclamation mechanism. With fixed 2 second timeout for every stream created on single random access and only 8 streams, it require I/O to be extremely sequential for prefetch to work. It is OK for simple files, but very restrictive for file-backed block storages or I guess databses. But I haven't touched this area intentionally now, since it is endless. At very least I think we could reclaim faster streams that got no hits, or streams that saw no hits for certain amount of requests to not depend on arbitrary 2 seconds limit.

@amotin
Copy link
Member Author

amotin commented Mar 5, 2021

@ahrens polite ping.

Copy link
Member

@ahrens ahrens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review reminder

module/zfs/dmu.c Show resolved Hide resolved
module/zfs/dmu.c Show resolved Hide resolved
module/zfs/dmu.c Show resolved Hide resolved
module/zfs/dmu_zfetch.c Outdated Show resolved Hide resolved
Comment on lines 329 to 330
if (nblks == 0)
goto done; /* Already prefetched this before. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the goto here seems potentially fragile. Could we handle this by changing the if on line 319 to handle nblks==0 as well? i.e.

if (end_of_access_blkid >= maxblkid || nblks == 0) {
    ...
    return (NULL);
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also require if (zs != NULL && end_of_access_blkid >= maxblkid), not helping readability. So I don't think so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with that additional change, I think that avoiding the goto is worth it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would tangle the logic too much. Duplicated the code once more instead of goto.

module/zfs/dmu_zfetch.c Show resolved Hide resolved
include/sys/dmu_zfetch.h Outdated Show resolved Hide resolved
module/zfs/dmu_zfetch.c Show resolved Hide resolved
@amotin amotin force-pushed the zfetch_prepare branch 2 times, most recently from 064a8a6 to 99ad035 Compare March 8, 2021 18:40
@amotin
Copy link
Member Author

amotin commented Mar 9, 2021

Any more comments?

@tonynguien
Copy link
Contributor

These ideas all sound good to me. I'll take a look at the code this week. Do you have any performance test results to show the impact of the locking changes, etc?

@tonynguien could we run our performance tests with this PR to determine if there's any impact on our test cases? I imagine the relevant ones would be sequential read and (maybe) write, but it would be good to run the whole zfs performance test suite to see if the change in CPU overhead has any effect.

@ahrens
I'll work on this and post results in a day or two.

To make better predictions on parrallel workloads dmu_zfetch() should
be called as early as possible to reduce possible request reordering.
In particular, it should be called before dmu_buf_hold_array_by_dnode()
calls dbuf_hold(), which may sleep waiting for indirect blocks, waking
up multiple threads same time on completion, that can significantly
reorder the requests, making the stream look like random.  But we
should not issue prefetch requests before the on-demand ones, since
they may get to the disks first despite the I/O scheduler, increasing
on-demand request latency.

This patch splits dmu_zfetch() into two functions: dmu_zfetch_prepare()
and dmu_zfetch_run().  The first can be executed as early as needed.
It only updates statistics and makes predictions without issuing any
I/Os.  The I/O issuance is handled by dmu_zfetch_run(), which can be
called later when all on-demand I/Os are already issued.  It even
tracks the activity of other concurrent threads, issuing the prefetch
only when _all_ on-demand requests are issued.

For many years it was a big problem for storage servers, handling
deeper request queues from their clients, having to either serialize
consequential reads to make ZFS prefetcher usable, or execute the
incoming requests as-is and get almost no prefetch from ZFS, relying
only on deep enough prefetch by the clients.  Benefits of those ways
varied, but neither was perfect.  With this patch deeper queue
sequential read benchmarks with CrystalDiskMark from Windows via
iSCSI to FreeBSD target show me much better throughput with almost
100% prefetcher hit rate, comparing to almost zero before.

While there, I also removed per-stream zs_lock as useless, completely
covered by parent zf_lock.  Also I reused zs_blocks refcount to track
zf_stream linkage of the stream, since I believe previous zs_fetch ==
NULL check in dmu_zfetch_stream_done() was racy.

Delete prefetch streams when they reach ends of files.  It saves up
to 1KB of RAM per file, plus reduces searches through the stream list.

Block data prefetch (speculation and indirect block prefetch is still
done since they are cheaper) if all dbufs of the stream are already
in DMU cache.  First cache miss immediately fires all the prefetch
that would be done for the stream by that time.  It saves some CPU
time if same files within DMU cache capacity are read over and over.

Signed-off-by: Alexander Motin <[email protected]>
Sponsored-By: iXsystems, Inc.
@tonynguien
Copy link
Contributor

These ideas all sound good to me. I'll take a look at the code this week. Do you have any performance test results to show the impact of the locking changes, etc?
@tonynguien could we run our performance tests with this PR to determine if there's any impact on our test cases? I imagine the relevant ones would be sequential read and (maybe) write, but it would be good to run the whole zfs performance test suite to see if the change in CPU overhead has any effect.

@ahrens
I'll work on this and post results in a day or two.

We actually get a 20-25% gain in the cached reads workload. From what I can tell, the improvement came from dmu_zfetch() is no longer called with this change. The other test results are within margin of errors.

Very nicely done and thanks for making this change @amotin!

Local ZFS IO size KB thread/fs cnt baseline run1 baseline run2 dmu_zfetch run1 dmu_zfetch run2 %diff
random_reads 8 16 75178 72871 70121 71092 -4.62%
random_reads 8 32 119197 118987 117356 116933 -1.64%
random_writes 8 32 35352 35065 34562 35315 -0.77%
random_writes 8 128 34391 34686 34553 34524 0.00%
random_writes ZIL 9 16/16 29086 29069 28759 29188 -0.36%
random_writes ZIL 10 16/1 30825 30374 30865 30702 0.60%
random_writes ZIL 11 1/1 5648 5590 5518 5578 -1.26%
random_writes ZIL 12 4/1 15571 15077 15575 15282 0.68%
random_writes ZIL 13 4/4 16304 16284 16222 15857 -1.56%
random_writes ZIL 14 64/1 36394 36443 36359 36405 -0.10%
random_writes ZIL 15 64/64 26934 26770 26371 26538 -1.48%
sequential_reads_arc_cached 128 64 33930 34005 42722 42437 25.35%
sequential_reads_arc_cached 128 128 33507 33547 42031 41966 25.27%
sequential_reads_arc_cached 1024 64 3882 3889 4735 4725 21.73%
sequential_reads_arc_cached 1024 128 3855 3839 4578 4679 20.31%
sequential_reads 128 8 12286 12955 12081 12288 -3.45%
sequential_reads 128 16 13011 12955 13069 12771 -0.49%
sequential_reads 1024 8 1558 1543 1552 1541 -0.26%
sequential_reads 1024 16 1573 1573 1613 1611 2.48%
sequential_writes 128 16 47535 47425 47396 47794 0.24%
sequential_writes 128 32 62820 62171 62138 61694 -0.93%
sequential_writes 1024 16 6312 6308 6246 6205 -1.34%
sequential_writes 1024 32 6267 6192 6271 6192 0.03%
sequential_writes 8 16 827 818 822 825 0.12%
sequential_writes 8 32 824 813 811 799 -1.65%

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Mar 12, 2021
@adamdmoss
Copy link
Contributor

adamdmoss commented Mar 12, 2021

I was giving this branch a try-out and I hit some Oopses pretty quickly doing a bunch of heavy parallel reads (mostly from ARC):

[  190.710552] BUG: kernel NULL pointer dereference, address: 00000000000000a8
[  190.710556] #PF: supervisor write access in kernel mode
[  190.710557] #PF: error_code(0x0002) - not-present page
[  190.710558] PGD 0 P4D 0 
[  190.710560] Oops: 0002 [#4] SMP NOPTI
[  190.710562] CPU: 0 PID: 9469 Comm: cp Tainted: P      D    OE     5.4.0-58-generic #64~18.04.1-Ubuntu
[  190.710563] Hardware name: Gigabyte Technology Co., Ltd. Z68MA-D2H-B3/Z68MA-D2H-B3, BIOS F10 02/23/2012
[  190.710568] RIP: 0010:mutex_lock+0x1d/0x40
[  190.710569] Code: 5d c3 90 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 53 48 89 fb e8 9e e6 ff ff 65 48 8b 14 25 c0 6b 01 00 31 c0 <f0> 48 0f b1 13 75 03 5b 5d c3 48 89 df e8 b1 ff ff ff 5b 5d c3 0f
[  190.710570] RSP: 0018:ffffb9d9733abc10 EFLAGS: 00010246
[  190.710572] RAX: 0000000000000000 RBX: 00000000000000a8 RCX: 0000000000000000
[  190.710573] RDX: ffff9bc1a1a51740 RSI: ffffffffffffffff RDI: 00000000000000a8
[  190.710573] RBP: ffffb9d9733abc18 R08: 00000000801a0018 R09: ffffffffc25d3201
[  190.710574] R10: ffffb9d9733abb48 R11: 0000000000000000 R12: 00000000000000a8
[  190.710575] R13: 0000000000000000 R14: ffff9bc1511a0a80 R15: 0000000000000008
[  190.710576] FS:  00007f867c7e0800(0000) GS:ffff9bc2efe00000(0000) knlGS:0000000000000000
[  190.710577] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  190.710578] CR2: 00000000000000a8 CR3: 0000000269562004 CR4: 00000000000606f0
[  190.710579] Call Trace:
[  190.710660]  dmu_buf_hold_array_by_dnode+0x3a5/0x500 [zfs]
[  190.710689]  dmu_read_uio_dnode+0x49/0xf0 [zfs]
[  190.710692]  ? _cond_resched+0x27/0x40
[  190.710719]  dmu_read_uio_dbuf+0x45/0x60 [zfs]
[  190.710763]  zfs_read+0x139/0x3b0 [zfs]
[  190.710805]  zpl_iter_read+0xe0/0x180 [zfs]
[  190.710808]  new_sync_read+0x122/0x1b0
[  190.710811]  __vfs_read+0x29/0x40
[  190.710812]  vfs_read+0x8e/0x130
[  190.710813]  ksys_read+0xa7/0xe0
[  190.710815]  __x64_sys_read+0x1a/0x20
[  190.710818]  do_syscall_64+0x57/0x190
[  190.710820]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  190.710822] RIP: 0033:0x7f867bceb151
[  190.710824] Code: fe ff ff 48 8d 3d 3f 9b 0a 00 48 83 ec 08 e8 66 4a 02 00 66 0f 1f 44 00 00 48 8d 05 91 08 2e 00 8b 00 85 c0 75 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 57 f3 c3 0f 1f 44 00 00 41 54 55 49 89 d4 53
[  190.710825] RSP: 002b:00007ffe28eb90a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[  190.710826] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f867bceb151
[  190.710827] RDX: 0000000000020000 RSI: 00007f867c809000 RDI: 0000000000000003
[  190.710828] RBP: 0000000000000000 R08: 0000000000020000 R09: 0000000000000000
[  190.710829] R10: 0000000000020000 R11: 0000000000000246 R12: 0000000028eb9700
[  190.710830] R13: 0000000000000000 R14: 00007f867c829000 R15: 0000000028eb9700
[  190.710832] Modules linked in: nls_iso8859_1 wireguard ip6_udp_tunnel udp_tunnel iptable_security iptable_raw iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bpfilter snd_hrtimer rfcomm snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio intel_lpss_acpi intel_lpss idma64 virt_dma aufs overlay vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) bnep dm_crypt algif_skcipher af_alg binfmt_misc intel_rapl_msr zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) gpio_ich mei_hdcp intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper rapl intel_cstate btusb nvidia_drm(POE) nvidia_modeset(POE) btrtl btbcm btintel mxm_wmi serio_raw joydev nvidia(POE) bluetooth ecdh_generic ecc input_leds xpad(OE) ff_memless snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_hda_codec mac_hid snd_hda_core snd_hwdep
[  190.710872]  snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device snd_timer snd lpc_ich soundcore mei_me mei sch_fq_codel parport_pc ppdev lp parport sunrpc ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_logitech_hidpp hid_logitech_dj hid_generic usbhid hid uas usb_storage i915 video i2c_algo_bit nvme drm_kms_helper ahci syscopyarea libahci sysfillrect sysimgblt r8169 fb_sys_fops nvme_core realtek drm wmi
[  190.710903] CR2: 00000000000000a8
[  190.710907] ---[ end trace 11ed2c922e582238 ]---
[  190.710910] RIP: 0010:mutex_lock+0x1d/0x40
[  190.710912] Code: 5d c3 90 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 53 48 89 fb e8 9e e6 ff ff 65 48 8b 14 25 c0 6b 01 00 31 c0 <f0> 48 0f b1 13 75 03 5b 5d c3 48 89 df e8 b1 ff ff ff 5b 5d c3 0f
[  190.710913] RSP: 0018:ffffb9d971f3fc10 EFLAGS: 00010246
[  190.710914] RAX: 0000000000000000 RBX: 00000000000000a8 RCX: 0000000000000000
[  190.710915] RDX: ffff9bc1bc791740 RSI: ffffffffffffffff RDI: 00000000000000a8
[  190.710916] RBP: ffffb9d971f3fc18 R08: 0000000000000000 R09: ffffffffc25d3201
[  190.710916] R10: ffffb9d971f3fb48 R11: 0000000000000000 R12: 00000000000000a8
[  190.710917] R13: 0000000000000000 R14: ffff9bc1511a0b80 R15: 0000000000000008
[  190.710918] FS:  00007f867c7e0800(0000) GS:ffff9bc2efe00000(0000) knlGS:0000000000000000
[  190.710919] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  190.710920] CR2: 00000000000000a8 CR3: 0000000269562004 CR4: 00000000000606f0
[  190.887511] ------------[ cut here ]------------
[  190.887515] kernel BUG at /build/linux-hwe-5.4-Y7FW4I/linux-hwe-5.4-5.4.0/mm/slub.c:306!
[  190.887519] invalid opcode: 0000 [#5] SMP NOPTI
[  190.887522] CPU: 3 PID: 4722 Comm: z_rd_int Tainted: P      D    OE     5.4.0-58-generic #64~18.04.1-Ubuntu
[  190.887523] Hardware name: Gigabyte Technology Co., Ltd. Z68MA-D2H-B3/Z68MA-D2H-B3, BIOS F10 02/23/2012
[  190.887526] RIP: 0010:kfree+0x230/0x240
[  190.887528] Code: fe ff ff 41 0f b6 5c 24 51 eb ad 4d 89 e9 41 b8 01 00 00 00 48 89 d9 48 89 da 4c 89 e6 4c 89 f7 e8 e5 f2 ff ff e9 d3 fe ff ff <0f> 0b 4c 8b 25 c7 ce 36 01 e9 04 fe ff ff 66 90 66 66 66 66 90 4c
[  190.887529] RSP: 0018:ffffb9d94d9a7bc0 EFLAGS: 00010246
[  190.887530] RAX: ffff9bc2c0bb1e00 RBX: ffff9bc2c0bb1e00 RCX: ffff9bc2c0bb1e00
[  190.887531] RDX: 000000000001994a RSI: 0000000000000068 RDI: ffff9bc2c0bb1e00
[  190.887532] RBP: ffffb9d94d9a7be0 R08: ffff9bc11b5ce4e0 R09: 0000000000000000
[  190.887533] R10: ffff9bc0f8c2900d R11: 0000000000000000 R12: ffffe2429002ec40
[  190.887534] R13: ffffffffc25d23d3 R14: ffff9bc2ee0072c0 R15: ffff9bc1200ace00
[  190.887535] FS:  0000000000000000(0000) GS:ffff9bc2efec0000(0000) knlGS:0000000000000000
[  190.887536] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  190.887537] CR2: 0000555e4f247000 CR3: 0000000347a0a006 CR4: 00000000000606e0
[  190.887538] Call Trace:
[  190.887546]  spl_kmem_free+0x33/0x40 [spl]
[  190.887600]  dmu_zfetch_stream_done+0x61/0xd0 [zfs]
[  190.887624]  ? arc_buf_alloc_impl.isra.48+0x284/0x350 [zfs]
[  190.887649]  dbuf_prefetch_fini+0x1f/0x30 [zfs]
[  190.887674]  dbuf_issue_final_prefetch_done+0x1f/0x40 [zfs]
[  190.887697]  arc_read_done+0x26f/0x490 [zfs]
[  190.887720]  l2arc_read_done+0x987/0xb80 [zfs]
[  190.887721]  ? kmem_cache_free+0x294/0x2b0
[  190.887722]  ? kmem_cache_free+0x294/0x2b0
[  190.887726]  ? _cond_resched+0x19/0x40
[  190.887727]  ? mutex_lock+0x12/0x40
[  190.887770]  zio_done+0x43f/0x1020 [zfs]
[  190.887813]  zio_execute+0x99/0xf0 [zfs]
[  190.887817]  taskq_thread+0x2fc/0x4e0 [spl]
[  190.887820]  ? wake_up_q+0x80/0x80
[  190.887862]  ? zio_taskq_member.isra.12.constprop.18+0x70/0x70 [zfs]
[  190.887865]  kthread+0x121/0x140
[  190.887869]  ? task_done+0xb0/0xb0 [spl]
[  190.887870]  ? kthread_park+0x90/0x90
[  190.887872]  ret_from_fork+0x1f/0x40
[  190.887873] Modules linked in: nls_iso8859_1 wireguard ip6_udp_tunnel udp_tunnel iptable_security iptable_raw iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bpfilter snd_hrtimer rfcomm snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio intel_lpss_acpi intel_lpss idma64 virt_dma aufs overlay vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) bnep dm_crypt algif_skcipher af_alg binfmt_misc intel_rapl_msr zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) gpio_ich mei_hdcp intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper rapl intel_cstate btusb nvidia_drm(POE) nvidia_modeset(POE) btrtl btbcm btintel mxm_wmi serio_raw joydev nvidia(POE) bluetooth ecdh_generic ecc input_leds xpad(OE) ff_memless snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_hda_codec mac_hid snd_hda_core snd_hwdep
[  190.887898]  snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device snd_timer snd lpc_ich soundcore mei_me mei sch_fq_codel parport_pc ppdev lp parport sunrpc ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_logitech_hidpp hid_logitech_dj hid_generic usbhid hid uas usb_storage i915 video i2c_algo_bit nvme drm_kms_helper ahci syscopyarea libahci sysfillrect sysimgblt r8169 fb_sys_fops nvme_core realtek drm wmi
[  190.887917] ---[ end trace 11ed2c922e582239 ]---

x86_64, Linux version 5.4.0-58-generic (buildd@lgw01-amd64-040) (gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)) #64~18.04.1-Ubuntu SMP Wed Dec 9 17:11:11 UTC 2020

@amotin
Copy link
Member Author

amotin commented Mar 12, 2021

@adamdmoss I suppose it starts working fine if you disable prefetch? And if you revert that single commit it is also working fine? Debug build does not give anything interesting? I see l2arc_read_done() call, are you using L2ARC? Can you say anything more about your setup/workload?

@adamdmoss
Copy link
Contributor

adamdmoss commented Mar 12, 2021

@adamdmoss I suppose it starts working fine if you disable prefetch? And if you revert that single commit it is also working fine? Debug build does not give anything interesting? I see l2arc_read_done() call, are you using L2ARC? Can you say anything more about your setup/workload?

Haven't tried disabling prefetch or doing a debug build, though yes, if I undo this merge then the same case appears to work fine. I am indeed using L2ARC. Not sure what else is terribly interesting about this setup (happy to answer specifics).
The testcase is to read 9 x 1GB files at the same time (4 physical cores); the files in this dataset happen to be zstd-8 compressed with an 8k recordsize and compress at around 2:1. Not sure if any of those variables are relevant, this was just the first benchmark I tried which I had used for other purposes to compare optimizations.

@amotin
Copy link
Member Author

amotin commented Mar 13, 2021

@adamdmoss I appreciate your report, but I haven't seen anything like this myself on FreeBSD, despite running different tests for couple weeks now. I've just tried to add L2ARC, but it haven't caused anything so far. I'll continue investigation, but I'd appreciate if you could try some variations to collect more data. I may even guess that problem may be not in my code, but for example in incorrect number of callbacks from dbuf_prefetch_impl() or something like that, causing use-after free, which may be caused by some specific code path, that is why I thought about your L2ARC.

@amotin
Copy link
Member Author

amotin commented Mar 15, 2021

I've added some more assertions in separate commit to catch some hypothetical stream use-after-free cases.

@adamdmoss It would be good to try the code built with debug to see whether it give anything more useful. I am still unable to reproduce any issues after running mix of sequential and random reads from 9 compressible files with zstd-8 compression and 8KB recordsize with partial ARC and L2ARC hits for the most part of the weekend, so I'd appreciate either more information/scripts to reproduce the issue or some more debugging input.

@amotin
Copy link
Member Author

amotin commented Mar 17, 2021

@adamdmoss The assertion I've added in dmu_zfetch_stream_fini() supposed to explicitly catch the case of stream free while it is still on the zf_stream list. I'd ask to make sure that you do have the module built with ZFS_DEBUG/--enable-debug or whatever absence of NDEBUG it is called on Linux.

@adamdmoss
Copy link
Contributor

I'd ask to make sure that you do have the module built with ZFS_DEBUG/--enable-debug or whatever absence of NDEBUG it is called on Linux.

I suspect you're right, and even though I'm specifying ./configure --enable-debug etc those config options aren't making it all the way through to the final dkms build. I'll try again in my VM where I don't use a srpm/DKMS config and asserts definitely work.

@adamdmoss
Copy link
Contributor

I was having trouble repro'ing any issue on my VM until I raised /sys/module/zfs/parameters/zfetch_max_distance from ~8MB to 1GB (!); then it started hitting this assertion repeatedly:

 2236.349261] VERIFY3(ipf_start <= ipf_end) failed (547606 <= 433728)
[ 2236.349265] PANIC at dmu_zfetch.c:447:dmu_zfetch_run()
[ 2236.349266] Showing stack for process 765013
[ 2236.349269] CPU: 2 PID: 765013 Comm: cp Tainted: P           OE     5.8.0-45-generic #51~20.04.1-Ubuntu
[ 2236.349270] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 2236.349271] Call Trace:
[ 2236.349278]  dump_stack+0x74/0x92
[ 2236.349289]  spl_dumpstack+0x29/0x2b [spl]
[ 2236.349294]  spl_panic+0xd4/0xfc [spl]
[ 2236.349362]  ? rwsem_is_locked+0x9/0x20 [zfs]
[ 2236.349465]  ? queued_spin_unlock+0x9/0x10 [zfs]
[ 2236.349517]  ? do_raw_spin_unlock+0x9/0x10 [zfs]
[ 2236.349567]  ? __raw_spin_unlock+0x9/0x10 [zfs]
[ 2236.349616]  ? zfs_refcount_remove_many+0x116/0x1d0 [zfs]
[ 2236.349659]  dmu_zfetch_run+0x348/0x3b0 [zfs]
[ 2236.349699]  dmu_buf_hold_array_by_dnode+0x187/0x640 [zfs]
[ 2236.349738]  dmu_read_uio_dnode+0x4b/0x130 [zfs]
[ 2236.349798]  ? queued_spin_unlock+0x9/0x10 [zfs]
[ 2236.349855]  ? do_raw_spin_unlock+0x9/0x10 [zfs]
[ 2236.349894]  dmu_read_uio_dbuf+0x47/0x60 [zfs]
[ 2236.349951]  zfs_read+0x136/0x3a0 [zfs]
[ 2236.350009]  zpl_iter_read+0xa3/0x110 [zfs]
[ 2236.350012]  new_sync_read+0x10c/0x1a0
[ 2236.350014]  vfs_read+0x161/0x190
[ 2236.350015]  ksys_read+0x67/0xe0
[ 2236.350017]  __x64_sys_read+0x1a/0x20
[ 2236.350019]  do_syscall_64+0x49/0xc0
[ 2236.350021]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 2236.350023] RIP: 0033:0x7fd3272c0142

ipf_start = MAX(zs->zs_pf_blkid1, zs->zs_ipf_blkid1);
ipf_end = zs->zs_ipf_blkid1 = zs->zs_ipf_blkid;
mutex_exit(&zf->zf_lock);
ASSERT3S(pf_start, <=, pf_end);
ASSERT3S(ipf_start, <=, ipf_end);

^ this is the assert

epbs = zf->zf_dnode->dn_indblkshift - SPA_BLKPTRSHIFT;
ipf_start = P2ROUNDUP(ipf_start, 1 << epbs) >> epbs;
ipf_end = P2ROUNDUP(ipf_end, 1 << epbs) >> epbs;
ASSERT3S(ipf_start, <=, ipf_end);
issued = pf_end - pf_start + ipf_end - ipf_start;
if (issued > 1) {
/* More references on top of taken in dmu_zfetch_prepare(). */
zfs_refcount_add_many(&zs->zs_refs, issued - 1, NULL);

^ if ipf_start>ipf_end then ipf_end - ipf_start is negative, though pf_end - pf_start might still be positive and so pf_end - pf_start + ipf_end - ipf_start might also still be >1, meaning that we're doing zfs_refcount_add_many() with fewer refs than dbuf_prefetch_impl() calls.

(That logic itself isn't necessarily wrong, but I wanted to explain how ipf_start>ipf_end would end up taking too few refs - hence a premature destruction of zs.)

Comment on lines +52 to +53
uint64_t zs_pf_blkid1; /* first block to prefetch */
uint64_t zs_pf_blkid; /* block to prefetch up to */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan of the blkid1->blkid naming convention for a range; how about something like zs_pf_first_blkid and zs_pf_last_blkid ?

} else {
pf_start = pf_end = 0;
}
ipf_start = MAX(zs->zs_pf_blkid1, zs->zs_ipf_blkid1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why this ipf_start makes sense please?
I notice the original dmu_zfetch() code has the same line, but it's rewritten zs_pf_blkid with zs->zs_pf_blkid = pf_start + pf_nblks first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is that we do not need indirect block prefetch for blocks for which we are prefetching the data. But it can be that we've already prefetched some indirect blocks ahead, and we don't need to prefetch then again. That is why there is a MAX(). It is done here, slightly duplicating the logic, because we do not reach this code for every dmu_zfetch_prepare() call, so the code makes some adjustment for which blocks still fetch metadata and for which prefetch data.

@amotin
Copy link
Member Author

amotin commented Mar 18, 2021

I was having trouble repro'ing any issue on my VM until I raised /sys/module/zfs/parameters/zfetch_max_distance from ~8MB to 1GB (!); then it started hitting this assertion repeatedly:

Do you have tunings like that (or any) applied on your original test system reproducing the problem? That would totally explain the issue. It makes no sense to have data prefetch go further than indirect block prefetch, and I think it causes negative max_blks for indirect blocks, so zs->zs_ipf_blkid starting to go backward from previous position instead of forward.

I wanted to explain how ipf_start>ipf_end would end up taking too few refs - hence a premature destruction of zs.

Yes. That is exactly the reason I've added those assertions. I haven't anticipated the math trick you are showing, expecting some race condition instead, but this is interesting too. I'll take another look on the prefetch distance math tomorrow, but if you are using some prefetcher tunables on your original system, could you please remove them and confirm that problem goes away? Or at very least set zfetch_max_idistance bigger than zfetch_max_distance.

@adamdmoss
Copy link
Contributor

adamdmoss commented Mar 18, 2021

Do you have tunings like that (or any) applied on your original test system reproducing the problem?

I do! Though much less extreme. :)

Or at very least set zfetch_max_idistance bigger than zfetch_max_distance.

I'm testing that next reboot - idistance was indeed defaulting to less than my tuned zfetch_max_distance. (In my defense, idistance didn't exist as a tunable when I instituted the tuning. ;) )

@adamdmoss
Copy link
Contributor

That is exactly the reason I've added those assertions

Good thinking then..! Would you mind turning this ASSERT() (or at least one other good diagnostic assert on this problem path) into a VERIFY()? It'd prevent a fumbled tuning from cascading into (more-or-less) memory corruption on a non-debug system.

@adamdmoss
Copy link
Contributor

(Assuming there's not just a simpler better way to preserve good behavior across all cases anyway!)

@adamdmoss
Copy link
Contributor

Or at very least set zfetch_max_idistance bigger than zfetch_max_distance.

I've verified - that does solve it!

This removes dependincy between data and metadata prefetch distance
tunables, that previously allowed data prefetch go beyond metadata,
that made no sense and also caused stream reference counting problem.

Signed-off-by: Alexander Motin <[email protected]>
@amotin
Copy link
Member Author

amotin commented Mar 18, 2021

@adamdmoss I've just committed one line patch, making maximum indirect block prefetch distance to be measured from the last data prefetch block instead of demand read position. It should untie the two tunables, solving this problem. I'd appreciate if you could test it.

@adamdmoss
Copy link
Contributor

@adamdmoss I've just committed one line patch, making maximum indirect block prefetch distance to be measured from the last data prefetch block instead of demand read position. It should untie the two tunables, solving this problem. I'd appreciate if you could test it.

Great, thanks, I'll give it a shot.

@adamdmoss
Copy link
Contributor

0b65542 seems robust!

@behlendorf
Copy link
Contributor

@amotin @adamdmoss nice job getting to the root cause and fixing it. Then this should be ready to merge.

@amotin
Copy link
Member Author

amotin commented Mar 19, 2021

Then this should be ready to merge.

@behlendorf Yes, I am satisfied with the code at this stage. In future we may think how to move dmu_zfetch_prepare() even earlier for writes, since now we are reading indirect blocks before opening transaction, so prefetcher still sees significant reorder (or may be axe out that double read?), but next time.

@behlendorf behlendorf merged commit 891568c into openzfs:master Mar 20, 2021
jsai20 pushed a commit to jsai20/zfs that referenced this pull request Mar 30, 2021
To make better predictions on parallel workloads dmu_zfetch() should
be called as early as possible to reduce possible request reordering.
In particular, it should be called before dmu_buf_hold_array_by_dnode()
calls dbuf_hold(), which may sleep waiting for indirect blocks, waking
up multiple threads same time on completion, that can significantly
reorder the requests, making the stream look like random.  But we
should not issue prefetch requests before the on-demand ones, since
they may get to the disks first despite the I/O scheduler, increasing
on-demand request latency.

This patch splits dmu_zfetch() into two functions: dmu_zfetch_prepare()
and dmu_zfetch_run().  The first can be executed as early as needed.
It only updates statistics and makes predictions without issuing any
I/Os.  The I/O issuance is handled by dmu_zfetch_run(), which can be
called later when all on-demand I/Os are already issued.  It even
tracks the activity of other concurrent threads, issuing the prefetch
only when _all_ on-demand requests are issued.

For many years it was a big problem for storage servers, handling
deeper request queues from their clients, having to either serialize
consequential reads to make ZFS prefetcher usable, or execute the
incoming requests as-is and get almost no prefetch from ZFS, relying
only on deep enough prefetch by the clients.  Benefits of those ways
varied, but neither was perfect.  With this patch deeper queue
sequential read benchmarks with CrystalDiskMark from Windows via
iSCSI to FreeBSD target show me much better throughput with almost
100% prefetcher hit rate, comparing to almost zero before.

While there, I also removed per-stream zs_lock as useless, completely
covered by parent zf_lock.  Also I reused zs_blocks refcount to track
zf_stream linkage of the stream, since I believe previous zs_fetch ==
NULL check in dmu_zfetch_stream_done() was racy.

Delete prefetch streams when they reach ends of files.  It saves up
to 1KB of RAM per file, plus reduces searches through the stream list.

Block data prefetch (speculation and indirect block prefetch is still
done since they are cheaper) if all dbufs of the stream are already
in DMU cache.  First cache miss immediately fires all the prefetch
that would be done for the stream by that time.  It saves some CPU
time if same files within DMU cache capacity are read over and over.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Adam Moss <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored-By: iXsystems, Inc.
Closes openzfs#11652
ghost pushed a commit to truenas/zfs that referenced this pull request Apr 5, 2021
To make better predictions on parallel workloads dmu_zfetch() should
be called as early as possible to reduce possible request reordering.
In particular, it should be called before dmu_buf_hold_array_by_dnode()
calls dbuf_hold(), which may sleep waiting for indirect blocks, waking
up multiple threads same time on completion, that can significantly
reorder the requests, making the stream look like random.  But we
should not issue prefetch requests before the on-demand ones, since
they may get to the disks first despite the I/O scheduler, increasing
on-demand request latency.

This patch splits dmu_zfetch() into two functions: dmu_zfetch_prepare()
and dmu_zfetch_run().  The first can be executed as early as needed.
It only updates statistics and makes predictions without issuing any
I/Os.  The I/O issuance is handled by dmu_zfetch_run(), which can be
called later when all on-demand I/Os are already issued.  It even
tracks the activity of other concurrent threads, issuing the prefetch
only when _all_ on-demand requests are issued.

For many years it was a big problem for storage servers, handling
deeper request queues from their clients, having to either serialize
consequential reads to make ZFS prefetcher usable, or execute the
incoming requests as-is and get almost no prefetch from ZFS, relying
only on deep enough prefetch by the clients.  Benefits of those ways
varied, but neither was perfect.  With this patch deeper queue
sequential read benchmarks with CrystalDiskMark from Windows via
iSCSI to FreeBSD target show me much better throughput with almost
100% prefetcher hit rate, comparing to almost zero before.

While there, I also removed per-stream zs_lock as useless, completely
covered by parent zf_lock.  Also I reused zs_blocks refcount to track
zf_stream linkage of the stream, since I believe previous zs_fetch ==
NULL check in dmu_zfetch_stream_done() was racy.

Delete prefetch streams when they reach ends of files.  It saves up
to 1KB of RAM per file, plus reduces searches through the stream list.

Block data prefetch (speculation and indirect block prefetch is still
done since they are cheaper) if all dbufs of the stream are already
in DMU cache.  First cache miss immediately fires all the prefetch
that would be done for the stream by that time.  It saves some CPU
time if same files within DMU cache capacity are read over and over.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Adam Moss <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored-By: iXsystems, Inc.
Closes openzfs#11652
adamdmoss pushed a commit to adamdmoss/zfs that referenced this pull request Apr 10, 2021
To make better predictions on parallel workloads dmu_zfetch() should
be called as early as possible to reduce possible request reordering.
In particular, it should be called before dmu_buf_hold_array_by_dnode()
calls dbuf_hold(), which may sleep waiting for indirect blocks, waking
up multiple threads same time on completion, that can significantly
reorder the requests, making the stream look like random.  But we
should not issue prefetch requests before the on-demand ones, since
they may get to the disks first despite the I/O scheduler, increasing
on-demand request latency.

This patch splits dmu_zfetch() into two functions: dmu_zfetch_prepare()
and dmu_zfetch_run().  The first can be executed as early as needed.
It only updates statistics and makes predictions without issuing any
I/Os.  The I/O issuance is handled by dmu_zfetch_run(), which can be
called later when all on-demand I/Os are already issued.  It even
tracks the activity of other concurrent threads, issuing the prefetch
only when _all_ on-demand requests are issued.

For many years it was a big problem for storage servers, handling
deeper request queues from their clients, having to either serialize
consequential reads to make ZFS prefetcher usable, or execute the
incoming requests as-is and get almost no prefetch from ZFS, relying
only on deep enough prefetch by the clients.  Benefits of those ways
varied, but neither was perfect.  With this patch deeper queue
sequential read benchmarks with CrystalDiskMark from Windows via
iSCSI to FreeBSD target show me much better throughput with almost
100% prefetcher hit rate, comparing to almost zero before.

While there, I also removed per-stream zs_lock as useless, completely
covered by parent zf_lock.  Also I reused zs_blocks refcount to track
zf_stream linkage of the stream, since I believe previous zs_fetch ==
NULL check in dmu_zfetch_stream_done() was racy.

Delete prefetch streams when they reach ends of files.  It saves up
to 1KB of RAM per file, plus reduces searches through the stream list.

Block data prefetch (speculation and indirect block prefetch is still
done since they are cheaper) if all dbufs of the stream are already
in DMU cache.  First cache miss immediately fires all the prefetch
that would be done for the stream by that time.  It saves some CPU
time if same files within DMU cache capacity are read over and over.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Adam Moss <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored-By: iXsystems, Inc.
Closes openzfs#11652
ghost pushed a commit to truenas/zfs that referenced this pull request May 6, 2021
To make better predictions on parallel workloads dmu_zfetch() should
be called as early as possible to reduce possible request reordering.
In particular, it should be called before dmu_buf_hold_array_by_dnode()
calls dbuf_hold(), which may sleep waiting for indirect blocks, waking
up multiple threads same time on completion, that can significantly
reorder the requests, making the stream look like random.  But we
should not issue prefetch requests before the on-demand ones, since
they may get to the disks first despite the I/O scheduler, increasing
on-demand request latency.

This patch splits dmu_zfetch() into two functions: dmu_zfetch_prepare()
and dmu_zfetch_run().  The first can be executed as early as needed.
It only updates statistics and makes predictions without issuing any
I/Os.  The I/O issuance is handled by dmu_zfetch_run(), which can be
called later when all on-demand I/Os are already issued.  It even
tracks the activity of other concurrent threads, issuing the prefetch
only when _all_ on-demand requests are issued.

For many years it was a big problem for storage servers, handling
deeper request queues from their clients, having to either serialize
consequential reads to make ZFS prefetcher usable, or execute the
incoming requests as-is and get almost no prefetch from ZFS, relying
only on deep enough prefetch by the clients.  Benefits of those ways
varied, but neither was perfect.  With this patch deeper queue
sequential read benchmarks with CrystalDiskMark from Windows via
iSCSI to FreeBSD target show me much better throughput with almost
100% prefetcher hit rate, comparing to almost zero before.

While there, I also removed per-stream zs_lock as useless, completely
covered by parent zf_lock.  Also I reused zs_blocks refcount to track
zf_stream linkage of the stream, since I believe previous zs_fetch ==
NULL check in dmu_zfetch_stream_done() was racy.

Delete prefetch streams when they reach ends of files.  It saves up
to 1KB of RAM per file, plus reduces searches through the stream list.

Block data prefetch (speculation and indirect block prefetch is still
done since they are cheaper) if all dbufs of the stream are already
in DMU cache.  First cache miss immediately fires all the prefetch
that would be done for the stream by that time.  It saves some CPU
time if same files within DMU cache capacity are read over and over.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Adam Moss <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored-By: iXsystems, Inc.
Closes openzfs#11652
ghost pushed a commit to truenas/zfs that referenced this pull request May 7, 2021
To make better predictions on parallel workloads dmu_zfetch() should
be called as early as possible to reduce possible request reordering.
In particular, it should be called before dmu_buf_hold_array_by_dnode()
calls dbuf_hold(), which may sleep waiting for indirect blocks, waking
up multiple threads same time on completion, that can significantly
reorder the requests, making the stream look like random.  But we
should not issue prefetch requests before the on-demand ones, since
they may get to the disks first despite the I/O scheduler, increasing
on-demand request latency.

This patch splits dmu_zfetch() into two functions: dmu_zfetch_prepare()
and dmu_zfetch_run().  The first can be executed as early as needed.
It only updates statistics and makes predictions without issuing any
I/Os.  The I/O issuance is handled by dmu_zfetch_run(), which can be
called later when all on-demand I/Os are already issued.  It even
tracks the activity of other concurrent threads, issuing the prefetch
only when _all_ on-demand requests are issued.

For many years it was a big problem for storage servers, handling
deeper request queues from their clients, having to either serialize
consequential reads to make ZFS prefetcher usable, or execute the
incoming requests as-is and get almost no prefetch from ZFS, relying
only on deep enough prefetch by the clients.  Benefits of those ways
varied, but neither was perfect.  With this patch deeper queue
sequential read benchmarks with CrystalDiskMark from Windows via
iSCSI to FreeBSD target show me much better throughput with almost
100% prefetcher hit rate, comparing to almost zero before.

While there, I also removed per-stream zs_lock as useless, completely
covered by parent zf_lock.  Also I reused zs_blocks refcount to track
zf_stream linkage of the stream, since I believe previous zs_fetch ==
NULL check in dmu_zfetch_stream_done() was racy.

Delete prefetch streams when they reach ends of files.  It saves up
to 1KB of RAM per file, plus reduces searches through the stream list.

Block data prefetch (speculation and indirect block prefetch is still
done since they are cheaper) if all dbufs of the stream are already
in DMU cache.  First cache miss immediately fires all the prefetch
that would be done for the stream by that time.  It saves some CPU
time if same files within DMU cache capacity are read over and over.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Adam Moss <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored-By: iXsystems, Inc.
Closes openzfs#11652
ghost pushed a commit to truenas/zfs that referenced this pull request May 10, 2021
To make better predictions on parallel workloads dmu_zfetch() should
be called as early as possible to reduce possible request reordering.
In particular, it should be called before dmu_buf_hold_array_by_dnode()
calls dbuf_hold(), which may sleep waiting for indirect blocks, waking
up multiple threads same time on completion, that can significantly
reorder the requests, making the stream look like random.  But we
should not issue prefetch requests before the on-demand ones, since
they may get to the disks first despite the I/O scheduler, increasing
on-demand request latency.

This patch splits dmu_zfetch() into two functions: dmu_zfetch_prepare()
and dmu_zfetch_run().  The first can be executed as early as needed.
It only updates statistics and makes predictions without issuing any
I/Os.  The I/O issuance is handled by dmu_zfetch_run(), which can be
called later when all on-demand I/Os are already issued.  It even
tracks the activity of other concurrent threads, issuing the prefetch
only when _all_ on-demand requests are issued.

For many years it was a big problem for storage servers, handling
deeper request queues from their clients, having to either serialize
consequential reads to make ZFS prefetcher usable, or execute the
incoming requests as-is and get almost no prefetch from ZFS, relying
only on deep enough prefetch by the clients.  Benefits of those ways
varied, but neither was perfect.  With this patch deeper queue
sequential read benchmarks with CrystalDiskMark from Windows via
iSCSI to FreeBSD target show me much better throughput with almost
100% prefetcher hit rate, comparing to almost zero before.

While there, I also removed per-stream zs_lock as useless, completely
covered by parent zf_lock.  Also I reused zs_blocks refcount to track
zf_stream linkage of the stream, since I believe previous zs_fetch ==
NULL check in dmu_zfetch_stream_done() was racy.

Delete prefetch streams when they reach ends of files.  It saves up
to 1KB of RAM per file, plus reduces searches through the stream list.

Block data prefetch (speculation and indirect block prefetch is still
done since they are cheaper) if all dbufs of the stream are already
in DMU cache.  First cache miss immediately fires all the prefetch
that would be done for the stream by that time.  It saves some CPU
time if same files within DMU cache capacity are read over and over.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Adam Moss <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored-By: iXsystems, Inc.
Closes openzfs#11652
ghost pushed a commit to truenas/zfs that referenced this pull request May 10, 2021
To make better predictions on parallel workloads dmu_zfetch() should
be called as early as possible to reduce possible request reordering.
In particular, it should be called before dmu_buf_hold_array_by_dnode()
calls dbuf_hold(), which may sleep waiting for indirect blocks, waking
up multiple threads same time on completion, that can significantly
reorder the requests, making the stream look like random.  But we
should not issue prefetch requests before the on-demand ones, since
they may get to the disks first despite the I/O scheduler, increasing
on-demand request latency.

This patch splits dmu_zfetch() into two functions: dmu_zfetch_prepare()
and dmu_zfetch_run().  The first can be executed as early as needed.
It only updates statistics and makes predictions without issuing any
I/Os.  The I/O issuance is handled by dmu_zfetch_run(), which can be
called later when all on-demand I/Os are already issued.  It even
tracks the activity of other concurrent threads, issuing the prefetch
only when _all_ on-demand requests are issued.

For many years it was a big problem for storage servers, handling
deeper request queues from their clients, having to either serialize
consequential reads to make ZFS prefetcher usable, or execute the
incoming requests as-is and get almost no prefetch from ZFS, relying
only on deep enough prefetch by the clients.  Benefits of those ways
varied, but neither was perfect.  With this patch deeper queue
sequential read benchmarks with CrystalDiskMark from Windows via
iSCSI to FreeBSD target show me much better throughput with almost
100% prefetcher hit rate, comparing to almost zero before.

While there, I also removed per-stream zs_lock as useless, completely
covered by parent zf_lock.  Also I reused zs_blocks refcount to track
zf_stream linkage of the stream, since I believe previous zs_fetch ==
NULL check in dmu_zfetch_stream_done() was racy.

Delete prefetch streams when they reach ends of files.  It saves up
to 1KB of RAM per file, plus reduces searches through the stream list.

Block data prefetch (speculation and indirect block prefetch is still
done since they are cheaper) if all dbufs of the stream are already
in DMU cache.  First cache miss immediately fires all the prefetch
that would be done for the stream by that time.  It saves some CPU
time if same files within DMU cache capacity are read over and over.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Adam Moss <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored-By: iXsystems, Inc.
Closes openzfs#11652
ghost pushed a commit to truenas/zfs that referenced this pull request May 19, 2021
To make better predictions on parallel workloads dmu_zfetch() should
be called as early as possible to reduce possible request reordering.
In particular, it should be called before dmu_buf_hold_array_by_dnode()
calls dbuf_hold(), which may sleep waiting for indirect blocks, waking
up multiple threads same time on completion, that can significantly
reorder the requests, making the stream look like random.  But we
should not issue prefetch requests before the on-demand ones, since
they may get to the disks first despite the I/O scheduler, increasing
on-demand request latency.

This patch splits dmu_zfetch() into two functions: dmu_zfetch_prepare()
and dmu_zfetch_run().  The first can be executed as early as needed.
It only updates statistics and makes predictions without issuing any
I/Os.  The I/O issuance is handled by dmu_zfetch_run(), which can be
called later when all on-demand I/Os are already issued.  It even
tracks the activity of other concurrent threads, issuing the prefetch
only when _all_ on-demand requests are issued.

For many years it was a big problem for storage servers, handling
deeper request queues from their clients, having to either serialize
consequential reads to make ZFS prefetcher usable, or execute the
incoming requests as-is and get almost no prefetch from ZFS, relying
only on deep enough prefetch by the clients.  Benefits of those ways
varied, but neither was perfect.  With this patch deeper queue
sequential read benchmarks with CrystalDiskMark from Windows via
iSCSI to FreeBSD target show me much better throughput with almost
100% prefetcher hit rate, comparing to almost zero before.

While there, I also removed per-stream zs_lock as useless, completely
covered by parent zf_lock.  Also I reused zs_blocks refcount to track
zf_stream linkage of the stream, since I believe previous zs_fetch ==
NULL check in dmu_zfetch_stream_done() was racy.

Delete prefetch streams when they reach ends of files.  It saves up
to 1KB of RAM per file, plus reduces searches through the stream list.

Block data prefetch (speculation and indirect block prefetch is still
done since they are cheaper) if all dbufs of the stream are already
in DMU cache.  First cache miss immediately fires all the prefetch
that would be done for the stream by that time.  It saves some CPU
time if same files within DMU cache capacity are read over and over.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Adam Moss <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored-By: iXsystems, Inc.
Closes openzfs#11652
ghost pushed a commit to truenas/zfs that referenced this pull request May 25, 2021
To make better predictions on parallel workloads dmu_zfetch() should
be called as early as possible to reduce possible request reordering.
In particular, it should be called before dmu_buf_hold_array_by_dnode()
calls dbuf_hold(), which may sleep waiting for indirect blocks, waking
up multiple threads same time on completion, that can significantly
reorder the requests, making the stream look like random.  But we
should not issue prefetch requests before the on-demand ones, since
they may get to the disks first despite the I/O scheduler, increasing
on-demand request latency.

This patch splits dmu_zfetch() into two functions: dmu_zfetch_prepare()
and dmu_zfetch_run().  The first can be executed as early as needed.
It only updates statistics and makes predictions without issuing any
I/Os.  The I/O issuance is handled by dmu_zfetch_run(), which can be
called later when all on-demand I/Os are already issued.  It even
tracks the activity of other concurrent threads, issuing the prefetch
only when _all_ on-demand requests are issued.

For many years it was a big problem for storage servers, handling
deeper request queues from their clients, having to either serialize
consequential reads to make ZFS prefetcher usable, or execute the
incoming requests as-is and get almost no prefetch from ZFS, relying
only on deep enough prefetch by the clients.  Benefits of those ways
varied, but neither was perfect.  With this patch deeper queue
sequential read benchmarks with CrystalDiskMark from Windows via
iSCSI to FreeBSD target show me much better throughput with almost
100% prefetcher hit rate, comparing to almost zero before.

While there, I also removed per-stream zs_lock as useless, completely
covered by parent zf_lock.  Also I reused zs_blocks refcount to track
zf_stream linkage of the stream, since I believe previous zs_fetch ==
NULL check in dmu_zfetch_stream_done() was racy.

Delete prefetch streams when they reach ends of files.  It saves up
to 1KB of RAM per file, plus reduces searches through the stream list.

Block data prefetch (speculation and indirect block prefetch is still
done since they are cheaper) if all dbufs of the stream are already
in DMU cache.  First cache miss immediately fires all the prefetch
that would be done for the stream by that time.  It saves some CPU
time if same files within DMU cache capacity are read over and over.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Adam Moss <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored-By: iXsystems, Inc.
Closes openzfs#11652
ghost pushed a commit to truenas/zfs that referenced this pull request May 27, 2021
To make better predictions on parallel workloads dmu_zfetch() should
be called as early as possible to reduce possible request reordering.
In particular, it should be called before dmu_buf_hold_array_by_dnode()
calls dbuf_hold(), which may sleep waiting for indirect blocks, waking
up multiple threads same time on completion, that can significantly
reorder the requests, making the stream look like random.  But we
should not issue prefetch requests before the on-demand ones, since
they may get to the disks first despite the I/O scheduler, increasing
on-demand request latency.

This patch splits dmu_zfetch() into two functions: dmu_zfetch_prepare()
and dmu_zfetch_run().  The first can be executed as early as needed.
It only updates statistics and makes predictions without issuing any
I/Os.  The I/O issuance is handled by dmu_zfetch_run(), which can be
called later when all on-demand I/Os are already issued.  It even
tracks the activity of other concurrent threads, issuing the prefetch
only when _all_ on-demand requests are issued.

For many years it was a big problem for storage servers, handling
deeper request queues from their clients, having to either serialize
consequential reads to make ZFS prefetcher usable, or execute the
incoming requests as-is and get almost no prefetch from ZFS, relying
only on deep enough prefetch by the clients.  Benefits of those ways
varied, but neither was perfect.  With this patch deeper queue
sequential read benchmarks with CrystalDiskMark from Windows via
iSCSI to FreeBSD target show me much better throughput with almost
100% prefetcher hit rate, comparing to almost zero before.

While there, I also removed per-stream zs_lock as useless, completely
covered by parent zf_lock.  Also I reused zs_blocks refcount to track
zf_stream linkage of the stream, since I believe previous zs_fetch ==
NULL check in dmu_zfetch_stream_done() was racy.

Delete prefetch streams when they reach ends of files.  It saves up
to 1KB of RAM per file, plus reduces searches through the stream list.

Block data prefetch (speculation and indirect block prefetch is still
done since they are cheaper) if all dbufs of the stream are already
in DMU cache.  First cache miss immediately fires all the prefetch
that would be done for the stream by that time.  It saves some CPU
time if same files within DMU cache capacity are read over and over.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Adam Moss <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored-By: iXsystems, Inc.
Closes openzfs#11652
@amotin amotin deleted the zfetch_prepare branch August 24, 2021 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested) Type: Performance Performance improvement or performance problem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants