Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Temporary system-wide lockups with ZSTD when writing to dataset. #13409

Closed
ghost opened this issue May 3, 2022 · 16 comments
Closed

Temporary system-wide lockups with ZSTD when writing to dataset. #13409

ghost opened this issue May 3, 2022 · 16 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@ghost
Copy link

ghost commented May 3, 2022

System information

Type Version/Name
Distribution Name Arch Linux
Distribution Version N/A
Kernel Version 5.17.5-zen1-1-zen
Architecture x86_64
OpenZFS Version zfs-2.1.4-1

ZFS installed with the zfs-linux-zen package from the archzfs repository.

RAM: 20GB no ECC
CPU: i5-4440
No hypervisor.
zfs.zfs_arc_max=4294967296 zfs.zfs_arc_min=1073741824 zfs.l2arc_trim_ahead=1

Describe the problem you're observing

When compression=zstd is on, writing to the dataset causes the whole system, including cursor, to freeze for a short moment inconsistently about every 5 or so seconds while writing to the dataset.

Describe how to reproduce the problem

Create a dataset with compression=zstd recordsize=1M atime=off xattr=sa and copy a large amount of data to it.

The attributes after compression=zstd are what are also set on the dataset, I'm not sure if they contribute to the issue.

The system should temporarily lock up multiple times until the write finishes. Executing a sync immediately after also helps to cause a temporary lock up.

Include any warning/errors/backtraces from the system logs

dmesg shows no problems or anything relating to ZFS.

@ghost ghost added the Type: Defect Incorrect behavior (e.g. crash, hang) label May 3, 2022
@jittygitty
Copy link

@jmb6 Someone other than me might know if your CPU can make use of all the hardware acceleration available for compression.

But how about posting some top, iotop ie metrics on the system during that time writing to dataset when the freezes happen etc?

@rincebrain
Copy link
Contributor

Yeah, zstd is not ideal for interactive responsiveness at higher {compression, recordsize} levels. A few people have floated proposals to make this less bad, none of them have been merged yet. (I'd be curious if mine was of any use to you, for instance, but that only helps if the data is incompressible - compressible still costs you uninterrupted CPU time.)

As far as CPU-specific optimizations, other than some strategic compilation hints in the code, the only thing it can currently use is the BMI2 instructions, which Haswell just barely would support. (Newer zstd revisions can do one or two things with other instructions in some strategic places, but when I benchmarked them for ZFS, it didn't seem to be much if any of a gain. I can go wire up my Haswell system later and see if that's any different, though.)

@jittygitty
Copy link

@rincebrain Thanks for the efforts on that, looks like it was quite a bit of work with all that testing and is definitely something I'd want to try especially if using a "desktop" system with zstd, would be really great if @jmb6 can try your patch and report back : )

@rincebrain
Copy link
Contributor

Even if it works for incompressible data, actually compressible data isn't going to be any better.

Unfortunately, I don't have a great answer for that at the moment, though I've some ideas. @adamdmoss has a patch one could try for it - I believe #11709, though I haven't tested it recently.

@ghost
Copy link
Author

ghost commented May 6, 2022

Sorry for the delay.

Update: I just did a system update (ZFS didn't update though, neither did the kernel), and the problem seems to be less severe, but still there. Or at least I'm having trouble reproducing the problem to the severity I experienced it when I opened this issue. I can still get it to freeze for a moment if I execute a sync during the write. I'm not sure what's changed now, but this has been a persistent problem for me ever since zstd support was released for ZFS.

@jmb6 Someone other than me might know if your CPU can make use of all the hardware acceleration available for compression.

lscpu says that my CPU does support the BMI2 extension.

But how about posting some top, iotop ie metrics on the system during that time writing to dataset when the freezes happen etc?

(h)top shows that either z_wr_iss or z_wr_iss_h jumps to around 30% CPU usage close to the freeze, followed by the rsync process doing the write, and everything else is mostly idle. Memory usage (excluding cache) goes up by about a gigabyte during the write and is stable there throughout.

iotop shows disk write activity is between 40-90mbps during the write (no other processes are writing on the system) and jumps up to 100-200mbps when the freezes happen.

Yeah, zstd is not ideal for interactive responsiveness at higher {compression, recordsize} levels.

I am using the default level though (zstd-3). I am using a recordsize of 1M but reducing it to 128K didn't seem to have much of an effect, I still got the freezes. It was my impression though that zstd was supposed to be really good at realtime compression.

I just tested gzip, the freezing is worse with gzip. That's to be expected I guess, so it indicates it does indeed seem to be a CPU usage issue. But I would have thought that the kernel would preempt ZFS, so even so I'm not sure why there would be freezes even with high CPU usage.

I was also able to get a very short freeze with lz4 and a well timed sync.

So it seems that the problem isn't actually zstd-specific, I just only noticed it first with zstd.

Anyway I did a benchmark with the zstd command on one of the files I tested with:

3#.07.01-x86_64.iso : 817180672 -> 810134953 (x1.009), 1657.6 MB/s, 4277.5 MB/s

@rincebrain
Copy link
Contributor

rincebrain commented May 6, 2022

Shockingly, if you force the system to stop everything it's doing and flush IO, you get increased latency while it does that.

The problem, I believe, is that the writers doing the compression basically don't get preempted while running, so larger blocks or more expensive compression makes it markedly more exciting. The patch from @adamdmoss adds forced reschedules, IIRC, but that just makes it less bad, not fixes it.

I have some experiments I need to get back to about having different binning of IO thread priority (beyond what we already have) so that you don't end up stalling everyone with this, but haven't yet.

@adamdmoss
Copy link
Contributor

adamdmoss commented May 6, 2022

The patch from @adamdmoss adds forced reschedules, IIRC

It once did, but I removed those parts. I learned to relax and love the PREEMPT kernels; haven't looked back since.

@adamdmoss
Copy link
Contributor

I have some experiments I need to get back to about having different binning of IO thread priority (beyond what we already have) so that you don't end up stalling everyone with this, but haven't yet.

Heh, FWIW I run with local patches which do similar things, coarsely. It's a good idea IMHO for other reasons, but only mildly helps responsiveness, the problem being that although it may make a compression task less likely to be scheduled when other stuff is going on, once it's scheduled you'll still get that kerTHUNK of uninterruptability.

@ghost
Copy link
Author

ghost commented May 6, 2022

The problem, I believe, is that the writers doing the compression basically don't get preempted while running

I'm not very familiar with Linux internals but can the kernel not preempt itself at almost any point like it can to userspace processes?

I learned to relax and love the PREEMPT kernels

I'm running a PREEMPT kernel too.

@rincebrain
Copy link
Contributor

I believe the common modes are "more or less voluntary-only preemption" and "no, really, we might preempt you for any reason at any time outside preempt_disable blocks".

@ghost
Copy link
Author

ghost commented May 6, 2022

Is there a reason why ZFS compression code runs in the former mode?

@rincebrain
Copy link
Contributor

I should be more clear - I believe those are the two modes for the kernel, not ones you can swap at runtime.

The compression code wouldn't randomly schedule a preemption because A) it's mostly stock code that has no idea it would need to care, and B) you would wreck performance at the high end by unconditionally rescheduling every op.

I had a few ideas in writing this reply though, maybe I'll try them...

@ghost
Copy link
Author

ghost commented May 6, 2022

So if I understand correctly, the preemption mode is a compile time option.

My kernel has preemption at any point enabled:

$ uname -a
Linux arch 5.17.5-zen1-1-zen #1 ZEN SMP PREEMPT Wed, 27 Apr 2022 20:56:14 +0000 x86_64 GNU/Linux
zcat /proc/config.gz | grep PREEMPT
CONFIG_PREEMPT_BUILD=y
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_PREEMPT_COUNT=y
CONFIG_PREEMPTION=y
CONFIG_PREEMPT_DYNAMIC=y
CONFIG_PREEMPT_RCU=y
CONFIG_HAVE_PREEMPT_DYNAMIC=y
CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_DRM_I915_PREEMPT_TIMEOUT=640
# CONFIG_DEBUG_PREEMPT is not set
# CONFIG_PREEMPT_TRACER is not set
# CONFIG_PREEMPTIRQ_DELAY_TEST is not set

So why is the ZFS compressor not being preempted on my system? Is the compressor wrapped in a preempt_disable block or is this a bug?

@adamdmoss
Copy link
Contributor

It's baffling to me why you're not getting good results with a PREEMPT kernel. Unless it's a Zen / 5.17 thing, though I doubt it.

I think 5.17 has runtime-switchable preemption support via kernel self-patching, and I guess it's possible that it's not currently switched on for you, but I can't find any clear documentation for this.

Alternatively, stock ZFS may run compression tasks at high priority (I'm looking at the code but I can't immediately remember if higher numbers are higher or lower priority 🤷) and the scheduler is just deciding not to interrupt it even though it could; you could try setting the spl module parameter spl_taskq_thread_priority to 0 to leave all task queue priorities as default. That could be interesting...

@rincebrain
Copy link
Contributor

rincebrain commented May 6, 2022

Lower numbers win on Linux, I believe.

Note that that tunable won't adjust existing thread priorities.

@ghost
Copy link
Author

ghost commented May 6, 2022

you could try setting the spl module parameter spl_taskq_thread_priority to 0 to leave all task queue priorities as default

I tried it out and it looks like it completely solved it. Even if I set the compression mode to gzip or zstd-9 there are no lockups at all, completely smooth. So yeah, looks like the kernel was deciding not to preempt the thread. I checked what the priority was before I set spl_taskq_thread_priority and z_wr_iss_h had a niceness of -20, which is the highest priority, and z_wr_iss was at -19.

Could leaving that spl parameter set to 0 on a desktop system cause other problems?

I just tried stock non-zen kernel as well but that didn't solve it. Only setting that parameter solves it.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

3 participants