-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metadata Allocation Class #3779
Comments
I swear someone else was working on a metadata-specific vdev for much the same purpose. |
@DeHackEd I have seen writings about the concept but FWIW, no recollection of anyone else working on it in an open source area. Found today: #1071 (comment) mentioning tegile, and Google finds e.g.:
|
It might be nice to allow (optionally!) one (or two?) of the metadata ditto blocks to reside in the ordinary pool, as well, making the metadata vdevs a kind of dedicated, write-through cache. (Different from L2ARC with secondarycache=metadata because they really would be holding authoritative copies of the metadata, but in dire straits could still be removed from the pool or used at lower fault tolerance -- one SSD instead of two in a mirror, etc.) |
I just watched the Livestream presentation on this. This is definitely a feature needed in ZFS. I've struggled to keep metadata cached. The only way I've been able to get decent amount of metadata cached is to set L2ARC to metadata only, but it takes many passes, which still probably misses a significant amount. I would love to build pools with metadata on SSD tiers. Metadata typically accounts for 1/2 my disk I/O load. There was a presentation from Nexenta a couple years back about tiered storage pools that had similar goals. http://www.open-zfs.org/w/images/7/71/ZFS_tiering.pdf I haven't heard anything of this effort since. |
Hi @don-brady |
@tuxoko I'm hoping to post a public WIP branch soon. The creation/addition of VDEVs dedicated to specific metadata classes is functional. Currently working out accounting issues in metaslab layer. We just started running ztest with metadata only classes to help shake out any edge cases (found a few). Let me come up with a to-do list so others can help. |
Sounds great!! |
Hi @don-brady |
It seems that the WIP pull request #5182 was never referenced here, or vice versa... |
It's been a year since the last update and I was wondering how this is progressing. Specifically, I'm interested in DDT devices. Moving the DDT onto dedicated high-speed devices should allow dedupe to function nearly as fast as the current memory-only implementation, but require much less memory. For storage of VMs, dedupe could easily save much more space than compression, but the current memory requirements usually make it too costly. |
See #5182 for the WIP. It's gone through a few iterations but I'm running an (old) version here. Very satisfied thus far. (No dedup, just regular metadata) |
@pashford So far as I know, DDT metadata can reside on L2ARC devices. The only thing this would change is to permit writebacks to go to faster media, rather than the primary (spinning rust) storage. That seems like it's unlikely to be a huge improvement vs. just having the DDT hang around in L2ARC. |
Thanks for the information. @nwf,
If the writeback goes to a faster media, then a future DDT miss in the ARC/L2ARC would also go to that faster media, which would generate a performance bump. As an example, if you have a 2PB pool of 7200RPM storage, and have a few fast SSDs (SATA, SAS or NVME) as DDT devices, the DDT performance WILL be better, especially if only a portion of the DDT is kept in memory. |
Hi @don-brady, may I ask a silly question -- what are the redundancy requirements for DDT storage? I mean -- could it be possible to reconstruct the deduplication table in the case if the present DDT data is lost? |
( @don-brady ) To add, we are currently looking into enabling deduplication on our 0.5 Pb research storage cluster, and I'd be very much interested to test this feature. We are running zfsonlinux 0.6.5 ( Ubuntu LTS 16.04 ), but if you could point me in the direction of the most recent update ( #5182 (comment) ? ), I can start with build tests etc. |
Is there anything left to do in this ticket, or should it be closed now that PR #5182 landed? |
Yup, we can close this. Thanks. |
Intel is working on ways to isolate large block file data from metadata for ZFS on Linux. In addition to the size discrepancy with file data, metadata often has a more transient lifecycle and additional redundancy requirements (ditto blocks). Metadata is often a poor match for a RAIDZ tier since it cannot be dispersed and the relative parity overhead is high. A mirrored redundancy is a better choice for metadata.
A metadata-only allocation tier is being added to the existing storage pool allocation class mechanism and is used as the primary source for metadata allocations. File data remains in the normal class. Each top-level metadata VDEV is tagged as belonging to the metadata allocation class and at runtime becomes associated with a pool’s metadata allocation class. The remaining (i.e. non-designated) top-level VDEVs default to the normal allocation class distinction. In addition to generic metadata, the performance sensitive deduplication table (DDT) data can also benefit from having its own separate allocation class.
More details to follow.
The text was updated successfully, but these errors were encountered: