-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock under heavy I/O using xattrs and deduplication #1657
Comments
Which version of ZFSOnLinux are you using and which distribution? |
Latest source release download - 0.6.1. Linux is 3.9.10 with Infiniband and 1000hz kernel. Not a stock |
@wbrown It appears the system is blocked reading deduplication table entries from disk. When it hangs like this are you still seeing lots of I/O being issued or does the system appear idle? |
When it hangs like this, I/O comes to a near standstill; zero writes and reads across the board to the entire pool, except for a few twitches. |
@wbrown So what's happening here is that a critical IO is not completing (either successfully or failing). There have been a handful of reports of this sort of problem and the root cause in most instances is misbehaving hardware, typically under heavy load. However, similar problems do get occasionally reported for other the zfs platforms so there may be a common bug here. What's never been easy to determine in these situations is exactly what state the IO is in. If you're running fairly recent code for ZoL it would be helpful to dump the |
Next time I see this hang, I'll do a Some data points:
|
One thing that I'd altered was disabling the
The output from
|
Some further news here -- after studying the issue, I realized that despite attempting to force all writes to go through my ZeusRAM ZIL SLOG by setting the pool's What it turned out was that despite my attempts to use the ZIL as the write throttle, writes larger than 1MB were going straight to the pool. GlusterFS does write coalescing so that writes can often be larger than 1MB. This combination of Gluster's write-back cache and the With the ZIL being written straight to the pool, and with Increasing the |
@wbrown Thank you for following up in the issue. This completely explains why you were observing issues with your workload. Can you summarize what non-standard tunings you needed to make on your system to get gluster behaving well. |
Sure, @behlendorf . This is for a GlusterFS setup layered on top of ZFS with heavy deduplication of content that deduplicates well. Hardware:
The pool is set to:
Compression is good, especially cheap LZ4 compression as it reduces the amount of I/O to disk.
Deduplication is also good if you have a dataset that works really well for this. But in the scale of terabytes, you really need a maxed out system.
Best results here for my data set.
This is to force all writes to be committed via the ZIL SSD. This device has both latency and throughput, so the The pool's L2ARC is set to metadata only, for the deduplication tables:
Set the ZIL SLOG limit to 256MB so that there's a near dead certainty that writes are going to the ZIL. If this doesn't happen, I risk a daily system implosion from
I have had to set this to get any sort of decent performance from my workload -- the write throttle algorithm seems to really be off in the case of a heavy synchronous dedup load, and stalls my systems a lot.
Prefetch is nearly worthless in the context of many, many small files -- in the range of hundreds of millions.
The above has appeared to help the stability of my system, though this is questionable and not proven.
This allows ZFS to take up about 70% of system memory for the ARC cache.
One of the most important settings for a deduplication server -- as high an ARC metadata as you can get away with.
This is helpful when you have hundreds of synchronous write threads.
I get 50% cache hit rate, and this helps with the metadata loads for deduplication.
This used to be set to 30 -- when combined with the ZeusRAM SLOG, I got really good write performance with my workload, but it was apparently impacting on system stability.
I have enterprise SAS disks with dual-pathing, so this leverages the SAS bus to its fullest. Don't try this with SATA disks.
I want my L2ARC to be filled with metadata as fast as it can take it rather than take the risk of page eviction. |
@wbrown Can you reproduce this in 0.6.3? |
Closing for now, we can reopen it if we get confirmation it's still and issue. |
I've been having system hangs on my ZoL systems running Gluster -- this typically seems to happen under heavy I/O, and appears to be a race condition of some sort.
The zpool in question has 15 mirrors, a L2ARC SSD, and forced sync to a SAS ZIL.
The text was updated successfully, but these errors were encountered: