Split dmu_zfetch() speculation and execution parts. #11652

amotin · 2021-02-25T16:13:56Z

To make better predictions on parrallel workloads dmu_zfetch() should
be called as early as possible to reduce possible request reordering.
In particular, it should be called before dmu_buf_hold_array_by_dnode()
calls dbuf_hold(), which may sleep waiting for indirect blocks, waking
up multiple threads same time on completion, that can significantly
reorder the requests, making the stream look like random. But we
should not issue prefetch requests before the on-demand ones, since
they may get to the disks first despite the I/O scheduler, increading
on-demand request latency.

This patch splits dmu_zfetch() into two functions: dmu_zfetch_prepare()
and dmu_zfetch_run(). The first can be executed as early as needed.
It only updates statistics and makes predictions without issuing any
I/Os. The I/O issuance is handled by dmu_zfetch_run(), which can be
called later when all on-demand I/Os are already issued. It even
tracks the activity of other concurrent threads, issuing the prefetch
only when all on-demand requests are issued.

For many years it was a big problem for storage servers, handling
deeper request queues from their clients, having to either serialize
consequential reads to make ZFS prefetcher usable, or execute the
incoming requests as-is and get almost no prefetch from ZFS, relying
only on deep enough prefetch by the clients. Benefits of those ways
varied, but neither was perfect. With this patch deeper queue
sequential read benchmarks with CrystalDiskMark from Windows via
iSCSI to FreeBSD target show me much better throughput with almost
100% prefetcher hit rate, comparing to almost zero before.

While there, I also removed per-stream zs_lock as useless, completely
covered by parent zf_lock. Also I reused zs_blocks refcount to track
zf_stream linkage of the stream, since I believe previous zs_fetch ==
NULL check in dmu_zfetch_stream_done was racy.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

amotin · 2021-02-28T03:54:09Z

I hit few rare panics while testing this deeper, so I've added some more locking.

amotin · 2021-03-01T02:03:48Z

While being in the speculative prefetcher area I've got two more ideas,
pushed in separate commit:

Delete prefetch streams when they reach end of files. It saves up
to 1KB of RAM per file, plus reduces searches through the stream list.

Block data prefetch (speculation and indirect block prefetch is still
done since they are cheaper) if all dbufs of the stream are already
in DMU cache. First cache miss immediately fires all the prefetch
that would be done for the stream by that time. It saves some CPU
time if same files within DMU cache capacity are read over and over.

ahrens · 2021-03-01T06:35:12Z

These ideas all sound good to me. I'll take a look at the code this week. Do you have any performance test results to show the impact of the locking changes, etc?

@tonynguien could we run our performance tests with this PR to determine if there's any impact on our test cases? I imagine the relevant ones would be sequential read and (maybe) write, but it would be good to run the whole zfs performance test suite to see if the change in CPU overhead has any effect.

amotin · 2021-03-01T16:02:16Z

Do you have any performance test results to show the impact of the locking changes, etc?

I did many tests (sequential and deep random) on zvols with 16KB volblocksize to stress it more, but haven't noticed significant locking difference. If there is a prediction miss, then the overhead of the code is pretty low, and I'd more blame stream reclamantion code, getting current time, using division and incrementing with atomics global miss counter. We definitely should start using more SMP-aware counters like FreeBSD's counter(9) for statistics rather than atomics on global variables. If there is a prediction hit, then the zfetch code is barely visible comparing to dbuf_prefetch_impl() and its descendants, that is why I've implemented the skipping it in case of cache hits.

What we should definitely improve, I think, is a stream reclamation mechanism. With fixed 2 second timeout for every stream created on single random access and only 8 streams, it require I/O to be extremely sequential for prefetch to work. It is OK for simple files, but very restrictive for file-backed block storages or I guess databses. But I haven't touched this area intentionally now, since it is endless. At very least I think we could reclaim faster streams that got no hits, or streams that saw no hits for certain amount of requests to not depend on arbitrary 2 seconds limit.

amotin · 2021-03-05T02:47:35Z

@ahrens polite ping.

ahrens

Thanks for the review reminder

module/zfs/dmu.c

module/zfs/dmu_zfetch.c

ahrens · 2021-03-08T04:27:32Z

module/zfs/dmu_zfetch.c

+	if (nblks == 0)
+		goto done;	/* Already prefetched this before. */


the goto here seems potentially fragile. Could we handle this by changing the if on line 319 to handle nblks==0 as well? i.e.

if (end_of_access_blkid >= maxblkid || nblks == 0) { ... return (NULL); }

It would also require if (zs != NULL && end_of_access_blkid >= maxblkid), not helping readability. So I don't think so.

Even with that additional change, I think that avoiding the goto is worth it.

It would tangle the logic too much. Duplicated the code once more instead of goto.

module/zfs/dmu_zfetch.c

include/sys/dmu_zfetch.h

module/zfs/dmu_zfetch.c

amotin · 2021-03-09T16:45:16Z

Any more comments?

tonynguien · 2021-03-09T17:34:58Z

These ideas all sound good to me. I'll take a look at the code this week. Do you have any performance test results to show the impact of the locking changes, etc?

@tonynguien could we run our performance tests with this PR to determine if there's any impact on our test cases? I imagine the relevant ones would be sequential read and (maybe) write, but it would be good to run the whole zfs performance test suite to see if the change in CPU overhead has any effect.

@ahrens
I'll work on this and post results in a day or two.

To make better predictions on parrallel workloads dmu_zfetch() should be called as early as possible to reduce possible request reordering. In particular, it should be called before dmu_buf_hold_array_by_dnode() calls dbuf_hold(), which may sleep waiting for indirect blocks, waking up multiple threads same time on completion, that can significantly reorder the requests, making the stream look like random. But we should not issue prefetch requests before the on-demand ones, since they may get to the disks first despite the I/O scheduler, increasing on-demand request latency. This patch splits dmu_zfetch() into two functions: dmu_zfetch_prepare() and dmu_zfetch_run(). The first can be executed as early as needed. It only updates statistics and makes predictions without issuing any I/Os. The I/O issuance is handled by dmu_zfetch_run(), which can be called later when all on-demand I/Os are already issued. It even tracks the activity of other concurrent threads, issuing the prefetch only when _all_ on-demand requests are issued. For many years it was a big problem for storage servers, handling deeper request queues from their clients, having to either serialize consequential reads to make ZFS prefetcher usable, or execute the incoming requests as-is and get almost no prefetch from ZFS, relying only on deep enough prefetch by the clients. Benefits of those ways varied, but neither was perfect. With this patch deeper queue sequential read benchmarks with CrystalDiskMark from Windows via iSCSI to FreeBSD target show me much better throughput with almost 100% prefetcher hit rate, comparing to almost zero before. While there, I also removed per-stream zs_lock as useless, completely covered by parent zf_lock. Also I reused zs_blocks refcount to track zf_stream linkage of the stream, since I believe previous zs_fetch == NULL check in dmu_zfetch_stream_done() was racy. Delete prefetch streams when they reach ends of files. It saves up to 1KB of RAM per file, plus reduces searches through the stream list. Block data prefetch (speculation and indirect block prefetch is still done since they are cheaper) if all dbufs of the stream are already in DMU cache. First cache miss immediately fires all the prefetch that would be done for the stream by that time. It saves some CPU time if same files within DMU cache capacity are read over and over. Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc.

tonynguien · 2021-03-12T02:30:26Z

These ideas all sound good to me. I'll take a look at the code this week. Do you have any performance test results to show the impact of the locking changes, etc?
@tonynguien could we run our performance tests with this PR to determine if there's any impact on our test cases? I imagine the relevant ones would be sequential read and (maybe) write, but it would be good to run the whole zfs performance test suite to see if the change in CPU overhead has any effect.

@ahrens
I'll work on this and post results in a day or two.

We actually get a 20-25% gain in the cached reads workload. From what I can tell, the improvement came from dmu_zfetch() is no longer called with this change. The other test results are within margin of errors.

Very nicely done and thanks for making this change @amotin!

Local ZFS	IO size KB	thread/fs cnt	baseline run1	baseline run2	dmu_zfetch run1	dmu_zfetch run2	%diff
random_reads	8	16	75178	72871	70121	71092	-4.62%
random_reads	8	32	119197	118987	117356	116933	-1.64%
random_writes	8	32	35352	35065	34562	35315	-0.77%
random_writes	8	128	34391	34686	34553	34524	0.00%
random_writes ZIL	9	16/16	29086	29069	28759	29188	-0.36%
random_writes ZIL	10	16/1	30825	30374	30865	30702	0.60%
random_writes ZIL	11	1/1	5648	5590	5518	5578	-1.26%
random_writes ZIL	12	4/1	15571	15077	15575	15282	0.68%
random_writes ZIL	13	4/4	16304	16284	16222	15857	-1.56%
random_writes ZIL	14	64/1	36394	36443	36359	36405	-0.10%
random_writes ZIL	15	64/64	26934	26770	26371	26538	-1.48%
sequential_reads_arc_cached	128	64	33930	34005	42722	42437	25.35%
sequential_reads_arc_cached	128	128	33507	33547	42031	41966	25.27%
sequential_reads_arc_cached	1024	64	3882	3889	4735	4725	21.73%
sequential_reads_arc_cached	1024	128	3855	3839	4578	4679	20.31%
sequential_reads	128	8	12286	12955	12081	12288	-3.45%
sequential_reads	128	16	13011	12955	13069	12771	-0.49%
sequential_reads	1024	8	1558	1543	1552	1541	-0.26%
sequential_reads	1024	16	1573	1573	1613	1611	2.48%
sequential_writes	128	16	47535	47425	47396	47794	0.24%
sequential_writes	128	32	62820	62171	62138	61694	-0.93%
sequential_writes	1024	16	6312	6308	6246	6205	-1.34%
sequential_writes	1024	32	6267	6192	6271	6192	0.03%
sequential_writes	8	16	827	818	822	825	0.12%
sequential_writes	8	32	824	813	811	799	-1.65%

behlendorf

Very nice!

adamdmoss · 2021-03-12T19:59:36Z

I was giving this branch a try-out and I hit some Oopses pretty quickly doing a bunch of heavy parallel reads (mostly from ARC):

[  190.710552] BUG: kernel NULL pointer dereference, address: 00000000000000a8
[  190.710556] #PF: supervisor write access in kernel mode
[  190.710557] #PF: error_code(0x0002) - not-present page
[  190.710558] PGD 0 P4D 0 
[  190.710560] Oops: 0002 [#4] SMP NOPTI
[  190.710562] CPU: 0 PID: 9469 Comm: cp Tainted: P      D    OE     5.4.0-58-generic #64~18.04.1-Ubuntu
[  190.710563] Hardware name: Gigabyte Technology Co., Ltd. Z68MA-D2H-B3/Z68MA-D2H-B3, BIOS F10 02/23/2012
[  190.710568] RIP: 0010:mutex_lock+0x1d/0x40
[  190.710569] Code: 5d c3 90 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 53 48 89 fb e8 9e e6 ff ff 65 48 8b 14 25 c0 6b 01 00 31 c0 <f0> 48 0f b1 13 75 03 5b 5d c3 48 89 df e8 b1 ff ff ff 5b 5d c3 0f
[  190.710570] RSP: 0018:ffffb9d9733abc10 EFLAGS: 00010246
[  190.710572] RAX: 0000000000000000 RBX: 00000000000000a8 RCX: 0000000000000000
[  190.710573] RDX: ffff9bc1a1a51740 RSI: ffffffffffffffff RDI: 00000000000000a8
[  190.710573] RBP: ffffb9d9733abc18 R08: 00000000801a0018 R09: ffffffffc25d3201
[  190.710574] R10: ffffb9d9733abb48 R11: 0000000000000000 R12: 00000000000000a8
[  190.710575] R13: 0000000000000000 R14: ffff9bc1511a0a80 R15: 0000000000000008
[  190.710576] FS:  00007f867c7e0800(0000) GS:ffff9bc2efe00000(0000) knlGS:0000000000000000
[  190.710577] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  190.710578] CR2: 00000000000000a8 CR3: 0000000269562004 CR4: 00000000000606f0
[  190.710579] Call Trace:
[  190.710660]  dmu_buf_hold_array_by_dnode+0x3a5/0x500 [zfs]
[  190.710689]  dmu_read_uio_dnode+0x49/0xf0 [zfs]
[  190.710692]  ? _cond_resched+0x27/0x40
[  190.710719]  dmu_read_uio_dbuf+0x45/0x60 [zfs]
[  190.710763]  zfs_read+0x139/0x3b0 [zfs]
[  190.710805]  zpl_iter_read+0xe0/0x180 [zfs]
[  190.710808]  new_sync_read+0x122/0x1b0
[  190.710811]  __vfs_read+0x29/0x40
[  190.710812]  vfs_read+0x8e/0x130
[  190.710813]  ksys_read+0xa7/0xe0
[  190.710815]  __x64_sys_read+0x1a/0x20
[  190.710818]  do_syscall_64+0x57/0x190
[  190.710820]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  190.710822] RIP: 0033:0x7f867bceb151
[  190.710824] Code: fe ff ff 48 8d 3d 3f 9b 0a 00 48 83 ec 08 e8 66 4a 02 00 66 0f 1f 44 00 00 48 8d 05 91 08 2e 00 8b 00 85 c0 75 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 57 f3 c3 0f 1f 44 00 00 41 54 55 49 89 d4 53
[  190.710825] RSP: 002b:00007ffe28eb90a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[  190.710826] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f867bceb151
[  190.710827] RDX: 0000000000020000 RSI: 00007f867c809000 RDI: 0000000000000003
[  190.710828] RBP: 0000000000000000 R08: 0000000000020000 R09: 0000000000000000
[  190.710829] R10: 0000000000020000 R11: 0000000000000246 R12: 0000000028eb9700
[  190.710830] R13: 0000000000000000 R14: 00007f867c829000 R15: 0000000028eb9700
[  190.710832] Modules linked in: nls_iso8859_1 wireguard ip6_udp_tunnel udp_tunnel iptable_security iptable_raw iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bpfilter snd_hrtimer rfcomm snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio intel_lpss_acpi intel_lpss idma64 virt_dma aufs overlay vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) bnep dm_crypt algif_skcipher af_alg binfmt_misc intel_rapl_msr zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) gpio_ich mei_hdcp intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper rapl intel_cstate btusb nvidia_drm(POE) nvidia_modeset(POE) btrtl btbcm btintel mxm_wmi serio_raw joydev nvidia(POE) bluetooth ecdh_generic ecc input_leds xpad(OE) ff_memless snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_hda_codec mac_hid snd_hda_core snd_hwdep
[  190.710872]  snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device snd_timer snd lpc_ich soundcore mei_me mei sch_fq_codel parport_pc ppdev lp parport sunrpc ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_logitech_hidpp hid_logitech_dj hid_generic usbhid hid uas usb_storage i915 video i2c_algo_bit nvme drm_kms_helper ahci syscopyarea libahci sysfillrect sysimgblt r8169 fb_sys_fops nvme_core realtek drm wmi
[  190.710903] CR2: 00000000000000a8
[  190.710907] ---[ end trace 11ed2c922e582238 ]---
[  190.710910] RIP: 0010:mutex_lock+0x1d/0x40
[  190.710912] Code: 5d c3 90 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 53 48 89 fb e8 9e e6 ff ff 65 48 8b 14 25 c0 6b 01 00 31 c0 <f0> 48 0f b1 13 75 03 5b 5d c3 48 89 df e8 b1 ff ff ff 5b 5d c3 0f
[  190.710913] RSP: 0018:ffffb9d971f3fc10 EFLAGS: 00010246
[  190.710914] RAX: 0000000000000000 RBX: 00000000000000a8 RCX: 0000000000000000
[  190.710915] RDX: ffff9bc1bc791740 RSI: ffffffffffffffff RDI: 00000000000000a8
[  190.710916] RBP: ffffb9d971f3fc18 R08: 0000000000000000 R09: ffffffffc25d3201
[  190.710916] R10: ffffb9d971f3fb48 R11: 0000000000000000 R12: 00000000000000a8
[  190.710917] R13: 0000000000000000 R14: ffff9bc1511a0b80 R15: 0000000000000008
[  190.710918] FS:  00007f867c7e0800(0000) GS:ffff9bc2efe00000(0000) knlGS:0000000000000000
[  190.710919] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  190.710920] CR2: 00000000000000a8 CR3: 0000000269562004 CR4: 00000000000606f0
[  190.887511] ------------[ cut here ]------------
[  190.887515] kernel BUG at /build/linux-hwe-5.4-Y7FW4I/linux-hwe-5.4-5.4.0/mm/slub.c:306!
[  190.887519] invalid opcode: 0000 [#5] SMP NOPTI
[  190.887522] CPU: 3 PID: 4722 Comm: z_rd_int Tainted: P      D    OE     5.4.0-58-generic #64~18.04.1-Ubuntu
[  190.887523] Hardware name: Gigabyte Technology Co., Ltd. Z68MA-D2H-B3/Z68MA-D2H-B3, BIOS F10 02/23/2012
[  190.887526] RIP: 0010:kfree+0x230/0x240
[  190.887528] Code: fe ff ff 41 0f b6 5c 24 51 eb ad 4d 89 e9 41 b8 01 00 00 00 48 89 d9 48 89 da 4c 89 e6 4c 89 f7 e8 e5 f2 ff ff e9 d3 fe ff ff <0f> 0b 4c 8b 25 c7 ce 36 01 e9 04 fe ff ff 66 90 66 66 66 66 90 4c
[  190.887529] RSP: 0018:ffffb9d94d9a7bc0 EFLAGS: 00010246
[  190.887530] RAX: ffff9bc2c0bb1e00 RBX: ffff9bc2c0bb1e00 RCX: ffff9bc2c0bb1e00
[  190.887531] RDX: 000000000001994a RSI: 0000000000000068 RDI: ffff9bc2c0bb1e00
[  190.887532] RBP: ffffb9d94d9a7be0 R08: ffff9bc11b5ce4e0 R09: 0000000000000000
[  190.887533] R10: ffff9bc0f8c2900d R11: 0000000000000000 R12: ffffe2429002ec40
[  190.887534] R13: ffffffffc25d23d3 R14: ffff9bc2ee0072c0 R15: ffff9bc1200ace00
[  190.887535] FS:  0000000000000000(0000) GS:ffff9bc2efec0000(0000) knlGS:0000000000000000
[  190.887536] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  190.887537] CR2: 0000555e4f247000 CR3: 0000000347a0a006 CR4: 00000000000606e0
[  190.887538] Call Trace:
[  190.887546]  spl_kmem_free+0x33/0x40 [spl]
[  190.887600]  dmu_zfetch_stream_done+0x61/0xd0 [zfs]
[  190.887624]  ? arc_buf_alloc_impl.isra.48+0x284/0x350 [zfs]
[  190.887649]  dbuf_prefetch_fini+0x1f/0x30 [zfs]
[  190.887674]  dbuf_issue_final_prefetch_done+0x1f/0x40 [zfs]
[  190.887697]  arc_read_done+0x26f/0x490 [zfs]
[  190.887720]  l2arc_read_done+0x987/0xb80 [zfs]
[  190.887721]  ? kmem_cache_free+0x294/0x2b0
[  190.887722]  ? kmem_cache_free+0x294/0x2b0
[  190.887726]  ? _cond_resched+0x19/0x40
[  190.887727]  ? mutex_lock+0x12/0x40
[  190.887770]  zio_done+0x43f/0x1020 [zfs]
[  190.887813]  zio_execute+0x99/0xf0 [zfs]
[  190.887817]  taskq_thread+0x2fc/0x4e0 [spl]
[  190.887820]  ? wake_up_q+0x80/0x80
[  190.887862]  ? zio_taskq_member.isra.12.constprop.18+0x70/0x70 [zfs]
[  190.887865]  kthread+0x121/0x140
[  190.887869]  ? task_done+0xb0/0xb0 [spl]
[  190.887870]  ? kthread_park+0x90/0x90
[  190.887872]  ret_from_fork+0x1f/0x40
[  190.887873] Modules linked in: nls_iso8859_1 wireguard ip6_udp_tunnel udp_tunnel iptable_security iptable_raw iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bpfilter snd_hrtimer rfcomm snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio intel_lpss_acpi intel_lpss idma64 virt_dma aufs overlay vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) bnep dm_crypt algif_skcipher af_alg binfmt_misc intel_rapl_msr zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) gpio_ich mei_hdcp intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper rapl intel_cstate btusb nvidia_drm(POE) nvidia_modeset(POE) btrtl btbcm btintel mxm_wmi serio_raw joydev nvidia(POE) bluetooth ecdh_generic ecc input_leds xpad(OE) ff_memless snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_hda_codec mac_hid snd_hda_core snd_hwdep
[  190.887898]  snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device snd_timer snd lpc_ich soundcore mei_me mei sch_fq_codel parport_pc ppdev lp parport sunrpc ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_logitech_hidpp hid_logitech_dj hid_generic usbhid hid uas usb_storage i915 video i2c_algo_bit nvme drm_kms_helper ahci syscopyarea libahci sysfillrect sysimgblt r8169 fb_sys_fops nvme_core realtek drm wmi
[  190.887917] ---[ end trace 11ed2c922e582239 ]---

x86_64, Linux version 5.4.0-58-generic (buildd@lgw01-amd64-040) (gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)) #64~18.04.1-Ubuntu SMP Wed Dec 9 17:11:11 UTC 2020

amotin · 2021-03-12T20:18:25Z

@adamdmoss I suppose it starts working fine if you disable prefetch? And if you revert that single commit it is also working fine? Debug build does not give anything interesting? I see l2arc_read_done() call, are you using L2ARC? Can you say anything more about your setup/workload?

adamdmoss · 2021-03-12T20:35:54Z

@adamdmoss I suppose it starts working fine if you disable prefetch? And if you revert that single commit it is also working fine? Debug build does not give anything interesting? I see l2arc_read_done() call, are you using L2ARC? Can you say anything more about your setup/workload?

Haven't tried disabling prefetch or doing a debug build, though yes, if I undo this merge then the same case appears to work fine. I am indeed using L2ARC. Not sure what else is terribly interesting about this setup (happy to answer specifics).
The testcase is to read 9 x 1GB files at the same time (4 physical cores); the files in this dataset happen to be zstd-8 compressed with an 8k recordsize and compress at around 2:1. Not sure if any of those variables are relevant, this was just the first benchmark I tried which I had used for other purposes to compare optimizations.

amotin · 2021-03-13T03:51:54Z

@adamdmoss I appreciate your report, but I haven't seen anything like this myself on FreeBSD, despite running different tests for couple weeks now. I've just tried to add L2ARC, but it haven't caused anything so far. I'll continue investigation, but I'd appreciate if you could try some variations to collect more data. I may even guess that problem may be not in my code, but for example in incorrect number of callbacks from dbuf_prefetch_impl() or something like that, causing use-after free, which may be caused by some specific code path, that is why I thought about your L2ARC.

amotin · 2021-03-15T20:08:43Z

I've added some more assertions in separate commit to catch some hypothetical stream use-after-free cases.

@adamdmoss It would be good to try the code built with debug to see whether it give anything more useful. I am still unable to reproduce any issues after running mix of sequential and random reads from 9 compressible files with zstd-8 compression and 8KB recordsize with partial ARC and L2ARC hits for the most part of the weekend, so I'd appreciate either more information/scripts to reproduce the issue or some more debugging input.

amotin · 2021-03-17T21:24:42Z

@adamdmoss The assertion I've added in dmu_zfetch_stream_fini() supposed to explicitly catch the case of stream free while it is still on the zf_stream list. I'd ask to make sure that you do have the module built with ZFS_DEBUG/--enable-debug or whatever absence of NDEBUG it is called on Linux.

adamdmoss · 2021-03-17T23:38:32Z

I'd ask to make sure that you do have the module built with ZFS_DEBUG/--enable-debug or whatever absence of NDEBUG it is called on Linux.

I suspect you're right, and even though I'm specifying ./configure --enable-debug etc those config options aren't making it all the way through to the final dkms build. I'll try again in my VM where I don't use a srpm/DKMS config and asserts definitely work.

adamdmoss · 2021-03-18T02:11:18Z

I was having trouble repro'ing any issue on my VM until I raised /sys/module/zfs/parameters/zfetch_max_distance from ~8MB to 1GB (!); then it started hitting this assertion repeatedly:

 2236.349261] VERIFY3(ipf_start <= ipf_end) failed (547606 <= 433728)
[ 2236.349265] PANIC at dmu_zfetch.c:447:dmu_zfetch_run()
[ 2236.349266] Showing stack for process 765013
[ 2236.349269] CPU: 2 PID: 765013 Comm: cp Tainted: P           OE     5.8.0-45-generic #51~20.04.1-Ubuntu
[ 2236.349270] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 2236.349271] Call Trace:
[ 2236.349278]  dump_stack+0x74/0x92
[ 2236.349289]  spl_dumpstack+0x29/0x2b [spl]
[ 2236.349294]  spl_panic+0xd4/0xfc [spl]
[ 2236.349362]  ? rwsem_is_locked+0x9/0x20 [zfs]
[ 2236.349465]  ? queued_spin_unlock+0x9/0x10 [zfs]
[ 2236.349517]  ? do_raw_spin_unlock+0x9/0x10 [zfs]
[ 2236.349567]  ? __raw_spin_unlock+0x9/0x10 [zfs]
[ 2236.349616]  ? zfs_refcount_remove_many+0x116/0x1d0 [zfs]
[ 2236.349659]  dmu_zfetch_run+0x348/0x3b0 [zfs]
[ 2236.349699]  dmu_buf_hold_array_by_dnode+0x187/0x640 [zfs]
[ 2236.349738]  dmu_read_uio_dnode+0x4b/0x130 [zfs]
[ 2236.349798]  ? queued_spin_unlock+0x9/0x10 [zfs]
[ 2236.349855]  ? do_raw_spin_unlock+0x9/0x10 [zfs]
[ 2236.349894]  dmu_read_uio_dbuf+0x47/0x60 [zfs]
[ 2236.349951]  zfs_read+0x136/0x3a0 [zfs]
[ 2236.350009]  zpl_iter_read+0xa3/0x110 [zfs]
[ 2236.350012]  new_sync_read+0x10c/0x1a0
[ 2236.350014]  vfs_read+0x161/0x190
[ 2236.350015]  ksys_read+0x67/0xe0
[ 2236.350017]  __x64_sys_read+0x1a/0x20
[ 2236.350019]  do_syscall_64+0x49/0xc0
[ 2236.350021]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 2236.350023] RIP: 0033:0x7fd3272c0142

ipf_start = MAX(zs->zs_pf_blkid1, zs->zs_ipf_blkid1);
ipf_end = zs->zs_ipf_blkid1 = zs->zs_ipf_blkid;
mutex_exit(&zf->zf_lock);
ASSERT3S(pf_start, <=, pf_end);
ASSERT3S(ipf_start, <=, ipf_end);

^ this is the assert

epbs = zf->zf_dnode->dn_indblkshift - SPA_BLKPTRSHIFT;
ipf_start = P2ROUNDUP(ipf_start, 1 << epbs) >> epbs;
ipf_end = P2ROUNDUP(ipf_end, 1 << epbs) >> epbs;
ASSERT3S(ipf_start, <=, ipf_end);
issued = pf_end - pf_start + ipf_end - ipf_start;
if (issued > 1) {
/* More references on top of taken in dmu_zfetch_prepare(). */
zfs_refcount_add_many(&zs->zs_refs, issued - 1, NULL);

^ if ipf_start>ipf_end then ipf_end - ipf_start is negative, though pf_end - pf_start might still be positive and so pf_end - pf_start + ipf_end - ipf_start might also still be >1, meaning that we're doing zfs_refcount_add_many() with fewer refs than dbuf_prefetch_impl() calls.

(That logic itself isn't necessarily wrong, but I wanted to explain how ipf_start>ipf_end would end up taking too few refs - hence a premature destruction of zs.)

adamdmoss · 2021-03-18T02:23:54Z

include/sys/dmu_zfetch.h

+	uint64_t	zs_pf_blkid1;	/* first block to prefetch */
+	uint64_t	zs_pf_blkid;	/* block to prefetch up to */


Not a fan of the blkid1->blkid naming convention for a range; how about something like zs_pf_first_blkid and zs_pf_last_blkid ?

adamdmoss · 2021-03-18T02:41:02Z

module/zfs/dmu_zfetch.c

+	} else {
+		pf_start = pf_end = 0;
+	}
+	ipf_start = MAX(zs->zs_pf_blkid1, zs->zs_ipf_blkid1);


Could you explain why this ipf_start makes sense please?
I notice the original dmu_zfetch() code has the same line, but it's rewritten zs_pf_blkid with zs->zs_pf_blkid = pf_start + pf_nblks first.

The idea here is that we do not need indirect block prefetch for blocks for which we are prefetching the data. But it can be that we've already prefetched some indirect blocks ahead, and we don't need to prefetch then again. That is why there is a MAX(). It is done here, slightly duplicating the logic, because we do not reach this code for every dmu_zfetch_prepare() call, so the code makes some adjustment for which blocks still fetch metadata and for which prefetch data.

amotin · 2021-03-18T02:45:40Z

I was having trouble repro'ing any issue on my VM until I raised /sys/module/zfs/parameters/zfetch_max_distance from ~8MB to 1GB (!); then it started hitting this assertion repeatedly:

Do you have tunings like that (or any) applied on your original test system reproducing the problem? That would totally explain the issue. It makes no sense to have data prefetch go further than indirect block prefetch, and I think it causes negative max_blks for indirect blocks, so zs->zs_ipf_blkid starting to go backward from previous position instead of forward.

I wanted to explain how ipf_start>ipf_end would end up taking too few refs - hence a premature destruction of zs.

Yes. That is exactly the reason I've added those assertions. I haven't anticipated the math trick you are showing, expecting some race condition instead, but this is interesting too. I'll take another look on the prefetch distance math tomorrow, but if you are using some prefetcher tunables on your original system, could you please remove them and confirm that problem goes away? Or at very least set zfetch_max_idistance bigger than zfetch_max_distance.

adamdmoss · 2021-03-18T03:09:39Z

Do you have tunings like that (or any) applied on your original test system reproducing the problem?

I do! Though much less extreme. :)

Or at very least set zfetch_max_idistance bigger than zfetch_max_distance.

I'm testing that next reboot - idistance was indeed defaulting to less than my tuned zfetch_max_distance. (In my defense, idistance didn't exist as a tunable when I instituted the tuning. ;) )

adamdmoss · 2021-03-18T03:14:27Z

That is exactly the reason I've added those assertions

Good thinking then..! Would you mind turning this ASSERT() (or at least one other good diagnostic assert on this problem path) into a VERIFY()? It'd prevent a fumbled tuning from cascading into (more-or-less) memory corruption on a non-debug system.

adamdmoss · 2021-03-18T03:15:33Z

(Assuming there's not just a simpler better way to preserve good behavior across all cases anyway!)

adamdmoss · 2021-03-18T03:25:57Z

Or at very least set zfetch_max_idistance bigger than zfetch_max_distance.

I've verified - that does solve it!

This removes dependincy between data and metadata prefetch distance tunables, that previously allowed data prefetch go beyond metadata, that made no sense and also caused stream reference counting problem. Signed-off-by: Alexander Motin <[email protected]>

amotin · 2021-03-18T22:18:36Z

@adamdmoss I've just committed one line patch, making maximum indirect block prefetch distance to be measured from the last data prefetch block instead of demand read position. It should untie the two tunables, solving this problem. I'd appreciate if you could test it.

adamdmoss · 2021-03-18T23:36:52Z

@adamdmoss I've just committed one line patch, making maximum indirect block prefetch distance to be measured from the last data prefetch block instead of demand read position. It should untie the two tunables, solving this problem. I'd appreciate if you could test it.

Great, thanks, I'll give it a shot.

adamdmoss · 2021-03-19T00:24:42Z

0b65542 seems robust!

behlendorf · 2021-03-19T23:04:41Z

@amotin @adamdmoss nice job getting to the root cause and fixing it. Then this should be ready to merge.

amotin · 2021-03-19T23:44:40Z

Then this should be ready to merge.

@behlendorf Yes, I am satisfied with the code at this stage. In future we may think how to move dmu_zfetch_prepare() even earlier for writes, since now we are reading indirect blocks before opening transaction, so prefetcher still sees significant reorder (or may be axe out that double read?), but next time.

To make better predictions on parallel workloads dmu_zfetch() should be called as early as possible to reduce possible request reordering. In particular, it should be called before dmu_buf_hold_array_by_dnode() calls dbuf_hold(), which may sleep waiting for indirect blocks, waking up multiple threads same time on completion, that can significantly reorder the requests, making the stream look like random. But we should not issue prefetch requests before the on-demand ones, since they may get to the disks first despite the I/O scheduler, increasing on-demand request latency. This patch splits dmu_zfetch() into two functions: dmu_zfetch_prepare() and dmu_zfetch_run(). The first can be executed as early as needed. It only updates statistics and makes predictions without issuing any I/Os. The I/O issuance is handled by dmu_zfetch_run(), which can be called later when all on-demand I/Os are already issued. It even tracks the activity of other concurrent threads, issuing the prefetch only when _all_ on-demand requests are issued. For many years it was a big problem for storage servers, handling deeper request queues from their clients, having to either serialize consequential reads to make ZFS prefetcher usable, or execute the incoming requests as-is and get almost no prefetch from ZFS, relying only on deep enough prefetch by the clients. Benefits of those ways varied, but neither was perfect. With this patch deeper queue sequential read benchmarks with CrystalDiskMark from Windows via iSCSI to FreeBSD target show me much better throughput with almost 100% prefetcher hit rate, comparing to almost zero before. While there, I also removed per-stream zs_lock as useless, completely covered by parent zf_lock. Also I reused zs_blocks refcount to track zf_stream linkage of the stream, since I believe previous zs_fetch == NULL check in dmu_zfetch_stream_done() was racy. Delete prefetch streams when they reach ends of files. It saves up to 1KB of RAM per file, plus reduces searches through the stream list. Block data prefetch (speculation and indirect block prefetch is still done since they are cheaper) if all dbufs of the stream are already in DMU cache. First cache miss immediately fires all the prefetch that would be done for the stream by that time. It saves some CPU time if same files within DMU cache capacity are read over and over. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Adam Moss <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes openzfs#11652

amotin added the Status: Code Review Needed Ready for review and testing label Feb 25, 2021

amotin requested review from pcd1193182 and ahrens February 25, 2021 16:13

amotin self-assigned this Feb 25, 2021

amotin force-pushed the zfetch_prepare branch from 6f7ef50 to 921f892 Compare February 25, 2021 16:29

behlendorf added the Type: Performance Performance improvement or performance problem label Feb 25, 2021

amotin force-pushed the zfetch_prepare branch from 921f892 to f586c6e Compare February 28, 2021 03:52

amotin force-pushed the zfetch_prepare branch from 91275d3 to a9e69ec Compare March 1, 2021 02:21

ahrens reviewed Mar 8, 2021

View reviewed changes

amotin force-pushed the zfetch_prepare branch 2 times, most recently from 064a8a6 to 99ad035 Compare March 8, 2021 18:40

ahrens approved these changes Mar 9, 2021

View reviewed changes

amotin force-pushed the zfetch_prepare branch from 1977e74 to d09bd9c Compare March 9, 2021 22:13

behlendorf approved these changes Mar 12, 2021

View reviewed changes

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Mar 12, 2021

adamdmoss reviewed Mar 18, 2021

View reviewed changes

behlendorf merged commit 891568c into openzfs:master Mar 20, 2021

behlendorf mentioned this pull request Apr 26, 2021

dmu_zfetch_run() panic #11936

Closed

amotin mentioned this pull request May 3, 2021

Prefetch doesn't kick in when reading a file sequentially through samba #9712

Closed

amotin deleted the zfetch_prepare branch August 24, 2021 20:18

		if (nblks == 0)
		goto done; /* Already prefetched this before. */

		uint64_t zs_pf_blkid1; /* first block to prefetch */
		uint64_t zs_pf_blkid; /* block to prefetch up to */

Split dmu_zfetch() speculation and execution parts. #11652

Split dmu_zfetch() speculation and execution parts. #11652

Conversation

amotin commented Feb 25, 2021

Types of changes

Checklist:

amotin commented Feb 28, 2021

amotin commented Mar 1, 2021

ahrens commented Mar 1, 2021

amotin commented Mar 1, 2021

amotin commented Mar 5, 2021

ahrens left a comment

Choose a reason for hiding this comment

ahrens Mar 8, 2021

Choose a reason for hiding this comment

amotin Mar 8, 2021

Choose a reason for hiding this comment

ahrens Mar 8, 2021

Choose a reason for hiding this comment

amotin Mar 8, 2021

Choose a reason for hiding this comment

amotin commented Mar 9, 2021

tonynguien commented Mar 9, 2021

tonynguien commented Mar 12, 2021

behlendorf left a comment

Choose a reason for hiding this comment

adamdmoss commented Mar 12, 2021 • edited Loading

amotin commented Mar 12, 2021

adamdmoss commented Mar 12, 2021 • edited Loading

amotin commented Mar 13, 2021 • edited Loading

amotin commented Mar 15, 2021 • edited Loading

amotin commented Mar 17, 2021

adamdmoss commented Mar 17, 2021

adamdmoss commented Mar 18, 2021

adamdmoss Mar 18, 2021

Choose a reason for hiding this comment

adamdmoss Mar 18, 2021

Choose a reason for hiding this comment

amotin Mar 19, 2021

Choose a reason for hiding this comment

amotin commented Mar 18, 2021

adamdmoss commented Mar 18, 2021 • edited Loading

adamdmoss commented Mar 18, 2021

adamdmoss commented Mar 18, 2021

adamdmoss commented Mar 18, 2021

amotin commented Mar 18, 2021

adamdmoss commented Mar 18, 2021

adamdmoss commented Mar 19, 2021

behlendorf commented Mar 19, 2021

amotin commented Mar 19, 2021 • edited Loading

adamdmoss commented Mar 12, 2021 •

edited

Loading

adamdmoss commented Mar 12, 2021 •

edited

Loading

amotin commented Mar 13, 2021 •

edited

Loading

amotin commented Mar 15, 2021 •

edited

Loading

adamdmoss commented Mar 18, 2021 •

edited

Loading

amotin commented Mar 19, 2021 •

edited

Loading