Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

L2ARC shall not lose valid pool metadata #10957

Open
zfsuser opened this issue Sep 20, 2020 · 17 comments
Open

L2ARC shall not lose valid pool metadata #10957

zfsuser opened this issue Sep 20, 2020 · 17 comments
Labels
Type: Feature Feature request or new feature

Comments

@zfsuser
Copy link

zfsuser commented Sep 20, 2020

Describe the feature would like to see added to OpenZFS

Requirements:

  • ZFS shall keep all (cached & still valid) pool metadata in L2ARC if tunables and size of pool metadata, L2ARC and ARC allow
  • The code change shall be minimally invasive, without requiring a redesign of L2ARC or on-disk format changes
  • The feature shall be relevant for pools with L2ARC and secondarycache set to all or metadata, and have no impact on pools without L2ARC or with secondarycache set to none or e.g. data
  • The feature shall be compatible with pools with one or multiple top level L2ARC vdevs
  • The feature shall be en-/disabled via zfs tunable
  • The amount of metadata shall be limited to a percentage of the L2ARC size via a zfs tunable to avoid having a negative impact on pools with very small blocksize and/or small L2ARC size relative to pool size
  • The impact of the feature shall be visible via zfs observables
  • The feature shall not cause ARC buffers to be moved from MRU to MFU
  • The feature shall not falsify zfs statistics / observables

Idea:

  • Before the L2ARC feed-thread deletes (overwrite or trim) the L2ARC area containing the oldest data, ARC_STATE_L2C_ONLY metadata in this area is read back into the ARC
  • This is only performed if conditions allow, else behaviour will be as before this feature was added.
  • To ensure the metadata read from the end of the L2ARC is written back to the L2ARC before it can be evicted from ARC, it might make sense to internally set a metadata L2ARC_HEADROOM to 0 in case the feature is enabled and condition allows

How will this feature improve OpenZFS?

  • The L2ARC feed-thread will not any-more delete still valid metadata which is only cached in L2ARC.
  • With this it could now make sense to warm up a L2ARC by reading the complete pool metadata (if zdb or zpool scrub would have an option to use the ARC for metadata)
  • There should be no downside as the feature only activates when circumstances allow and is user-configurable
  • See first section

Additional context

Condition:

  • The pool L2ARC has enough free space to store the cached metadata completely. The calculation could be similar to the following pseudo-code (assuming use of pool instead of vdev parameters):
    (l2arc_dev->"meta_buf_to_be_evicted_asize") < (pool->l2arc_available_asize + MIN(pool->l2arc_data_buf_asize, (pool->l2arc_data_buf_asize + pool->l2arc_meta_buf_asize) * l2arc_meta_limit_percent / 100%) - pool->l2arc_dev_count * 2 * L2ARC_WRITE_MAX * L2ARC_TRIM_AHEAD/100%)

Remarks:

  • Reading back metadata in the ARC without impacting MRU/MFU assignment might require adding or updating functions in arc.c
  • Depending on the implementation of the condition, asize of data and metadata buffers stored in the pool L2ARC needs to be available in the code.
  • Ensuring the metadata read from the end of the L2ARC is not immediately (in the same feed-thread) written back to (the same) L2ARC should allow for some load-balacing in case of multiple L2ARC toplevel vdevs (in case another L2ARC vdev was just addeed).

Tunables:

  • vfs.zfs.l2arc.keep_meta: (or a better name), 0=old behaviour, 1=new behaviour, default=0 (at the moment)
  • vfs.zfs.l2arc.meta_limit_percent: 0-100, default=100 (keeps original behaviour if keep_meta=0)

Observables:

  • kstat.zfs.misc.arcstats.l2_keep_meta_skip: (or a better name), a counter with the number of feeds where the condition paused the feature. Alternatively the number of feeds where metadata was evicted (there might already be a kstat for this)
@zfsuser zfsuser added the Type: Feature Feature request or new feature label Sep 20, 2020
@amotin
Copy link
Member

amotin commented Sep 21, 2020

To me this sounds like additional complication with no obvious benefits. ZFS already has small non-evictable metadata cache in RAM for the most important pool metadata. On top of that, normal ARC and L2ARC operation should ensure that (meta-)data accessed at least sometimes should be cached. If for some reason you need all of your metadata to reside on SSDs, just add special metadata vdev to your pool, that will be much more efficient from all perspectives than use of L2ARC. L2ARC should be used for cases where you can not predict active data set in advance, and in that context making some (meta-)data more special than others even if accessed only rarely is a step in wrong direction.

From purely mechanical since, I think there will be a problem with checksum verification. Since L2ARC header in RAM does not store it, unless there is actual read request with full block pointer, the code reloading blocks from L2ARC into ARC won't be able to verify the checksum.

@zfsuser
Copy link
Author

zfsuser commented Sep 21, 2020

The motivation is the wish to have a L2ARC which stores data and metadata, but prioritizes metadata. Basically behaving as with secondarycache=metadata, but in addition also storing data on opportunity bases. Have your cake and eat it too.

Without requiring a complete redesign of the L2ARC. Without requiring separate partitions for data and metadata, and a secondarycache property which can be configured per L2ARC top level vdev instead of once per pool, and in the end would most likely result in ineffective use of the physical L2ARC vdev.

In the end the idea is to keep the L2ARC as it is, and just prevent losing perfectly fine pool metadata when its storage area in the persistent L2ARC is being overwritten. The idea is not to store the complete pool metadata in the L2ARC, but yes, it could happen based on L2ARC size, tunables and access patterns

The special vdevs are very interesting but require interface-ports and drive-slots. And as the redundancy should be no less than that of the data disk of the pool, a raidz2 pool would require the ability to house and connect ~3 additional drives. While this is no issue for big irons, for SOHO it is quite often not possible.

Keeping rarely accessed metadata in the L2ARC should not be an issue. The L2ARC just have to be bigger than 0.1% (128kiB blocksize) to ~3% (4kiB blocksize) of the pool size, and/or a tunable like vfs.zfs.l2arc.meta_limit_percent has to be set to a value <100%. The tunable would ensure that enough of the L2ARC is available for random access (non-meta)data.

Regarding your point about zfs mechanics, do i understand your explanation correctly?:

Normally a block is read from the L2ARC by following a pointer stored in its parent block/buffer, which also contains the checksum of the L2ARC block? So if we would try to just read back L2ARC blocks, we would have no parent block and so would be missing the checksum to verify that the block was not corrupted?

Is this not a problem applying also to reading back the persistent L2ARC? Was this solved with the log-blocks? If yes, couldn't we use those logblocks to check the data is uncorrupted?

@richardelling
Copy link
Contributor

FYI, in Solaris 11, the metadata/data separation has been removed entirely. Can we be sure keeping the complexity of separate metadata/data caching is worth the trouble?

@amotin
Copy link
Member

amotin commented Sep 22, 2020

Normally a block is read from the L2ARC by following a pointer stored in its parent block/buffer, which also contains the checksum of the L2ARC block? So if we would try to just read back L2ARC blocks, we would have no parent block and so would be missing the checksum to verify that the block was not corrupted?

Right. L2ARC block checksum is identical to normal block checksum, since it uses the same compression/encryption, just stored in different place. It does not require separate storage.

Is this not a problem applying also to reading back the persistent L2ARC? Was this solved with the log-blocks? If yes, couldn't we use those logblocks to check the data is uncorrupted?

Persistent L2ARC does not reload the data into ARC, it only reconstructs previous L2ARC headers on pool import. The log blocks have their own checksums, which don't cover the actual data block. Any possible corruptions are detected later when the read is attempted by application, in which case read is just silently redirected to main storage.

@zfsuser
Copy link
Author

zfsuser commented Sep 23, 2020

Due to the smaller size of metadata, the same amount of L2ARC space will contain more metadata than data, and by this have a higher hit-probability. Also (if i have not misunderstood the discussion) having data in the (L2)ARC is not really helpful, if the corresponding metadata is not also cached and would need to be read from spinning rust. Getting rid of the separation would result in a simpler code, but metadata would lose its VIP handling, and the users would lose mechanisms to adapt their pool to their needs. In my opinion until somebody performs an in-depth analysis on this topic which undisputable shows the pros of getting rid of the separation outweigh the cons including rewrite of the zfs code with the possibility to introduce errors, the implemented separation of metadata/data caching is clearly worth it.

Interesting, so the persistent L2ARC is only reading back and checking the ARC L2ARC headers, and the L2ARC block are only checked when accessed due to a cache hit.

As we shall verify all data read from a persistent media against their checksum, an implementation of this feature seems to require:

  • A mechanism to find the parent metadata of an L2ARC_only metadata block to be able to verify the checksum of the block being read back.
  • Only metadata with parent metadata cached in ARC/L2ARC shall to be "rescued" from L2ARC overwrite.

@shodanshok
Copy link
Contributor

FYI, in Solaris 11, the metadata/data separation has been removed entirely. Can we be sure keeping the complexity of separate metadata/data caching is worth the trouble?

I think so: a correct using of the metadata property can make a very big difference when traversing dataset with millions of files. For example, I have a rsnapshot machine were ARC caches both data and metadata, while L2ARC caches metadata only. The performance improvements when iterating over these files (ie: by rsync) over a similarly configured XFS really is massive. Using secondarycache=metadata was a significant improvement over the default secondarycache=all setting.

So I would really like to maintain the data/metadata separation we have now.

@devZer0
Copy link

devZer0 commented May 7, 2021

yes, please !

i think removing the differencing of l2arc metadata/data would be damn stupid.

i have two systems where i'm under pressure now to address the "too many runtime spent in metadata access" problem.

first system is a backup server we use rsync+sanoid zfs rotating snapshots , containing tens of millions of files which rarely change.

we could add special vdev for metadata/smallfiles, but i dislike the idea of buying enterprise class , mirrored ssd for nothing but speeding up metadata access, as we would distribute the backup data to hdd and ssd. i do not want to "stripe" our companies backup to different types of harddisk, depending on each other for proper function. i'm admin for a long time and adding special vdev for backup is causing a hunch of subliminal discomfort. i think it's the wrong way to go. and , even worse, we need to rework the whole pool, as there is no method to push metadata to special vdev afterwards. we would need to take the system out of production for several days for that...

same goes for proxmox backup server, which is similar to borgbackup regarding data storage. proxmox documentation even recommends using ssd for the entire backup pool (doh!) or at least adding special vdev for metadata acceleration. pruning and garbage collection is metadata intensive workload and some "tiny" backup datastore with about 1,5TB of data will not function properly without adding ssd, as runtime for prune and gc already goes through the roof with that "few" data...

i think it's absurd to use special vdev for metadata (i.e. put the original data there) from the perspective, that we have "primarycache=all" & "secondarycache=metadata", which is meant exactly for adressing these types of problems , i.e. speeding up metadata read access by adding el-cheapo consumer grade ssd as a read cache. so, if they die, you use nothing but performance....and they are trivial to replace (no resilver...) - besides the "cache device removal hangs zfs/zpool bug"

i have seen this being discussed often and repeatedly, really curious why it's still not being adressed - see this old discussion for example:
https://illumos.topicbox.com/groups/zfs/T8729ed10fa3d42db-Mae35bc26ef8372ad4203ddaf

@malventano
Copy link

This may all be tangentially related to an issue I've been tracking where even when parameters are tuned to keep metadata in the arc (not l2arc), metadata continues to be prematurely purged when sufficient data passes through the arc.: #10508

@devZer0
Copy link

devZer0 commented May 18, 2021

#12028

@devZer0
Copy link

devZer0 commented Jun 16, 2021

to give another comment on this: i have added l2arc to the 2 systems mentioned above and with secondarycache=metadata runtime for rsync or for proxmox backup server garbage collection or verify has improved considerably since then.

i don't see the point why zfs cache is loosing metadata over and over again instead. it's precious cached data and it should be preferred/preserved.

@grahamperrin
Copy link
Contributor

@malventano
Copy link

malventano commented Jul 20, 2021

To me this sounds like additional complication with no obvious benefits. ZFS already has small non-evictable metadata cache in RAM for the most important pool metadata. On top of that, normal ARC and L2ARC operation should ensure that (meta-)data accessed at least sometimes should be cached.

The 'obvious benefits' are evident in that TrueNAS specifically sets arc_meta_min to a higher than default (multi-GB) value specifically to try and prioritize metadata preservation in the ARC. This still fails in the face of high data throughput. I'm frankly surprised that you didn't see the benefit here given how significantly it impacts TrueNAS use cases. If a user transfers a few hundred GB of files off of their NAS, they should not then expect a find operation that previously took seconds to now take tens of minutes to complete. That few GB of metadata takes far more time/IOPS to repopulate compared to a bulk sequential transfer - that data has no business displacing the metadata given the lopsided consequences of purging one over the other. So yes, the benefits are clear, now if only the implementation wasn't broken as it is currently.

@Ryushin
Copy link

Ryushin commented Jul 20, 2021

We have a very large 1.6PiB system using 232 hard drives in 21 raidz2 VDEVs with 512GiB of RAM. We have ten 15.4TB NVME drives that are partitioned with five 20GiB mirrors for SLOG and ten 10TiB L2ARC for cache (rest is left empty for garbage collection) giving us 100GiB for SLOG and 100TiB for L2ARC. We change on average about 10TiB of data each day, so the L2ARC can cache about ten days worth of data. Our dataset uses a 1M recordsize. System has two bonded 100Gb Ethernet connections serving dozens of users that are connected via 10Gb.

We have 4.7 million files on this system. With a cold L2ARC, file system traversal takes 65 minutes. After it's cached in ARC, it takes 21 seconds. On a nightly basis, I run a "find /storage > /dev/null" to traverse the entire dataset which takes 54 seconds as it's pulled from L2ARC to put back into ARC. This is really a bandaid as there should be an option to keep metadata from being evicted from L2ARC.

We went with L2ARC devices instead of a special VDEVs as it is more flexible for our needs. First, it does not tie the pool to a specific piece of hardware as we can just move the JBODs to a different server that does not have access to the NVME in case there was some kind of system failure Second, we do have prefetch turned on and have tuned the L2ARC for our needs with amazing results. It is common for us to have 90+% L2ARC hits during the day.

I cannot see a single downside with having ZFS L2ARC prioritize keeping metadata over other data. In most cases, this eliminates the need to have special VDEVs with another potential point of failure. In our case, we were originally looking at using six 15.4TB NVME SSDs in two 3-way mirrors for a specials. What a waste of NVME and they would become immutable to that pool once we did that. Instead, we decided to first test using those six NVME drives for L2ARC and the results were so dramatic, that we upped it to ten drives.

@shodanshok
Copy link
Contributor

@Ryushin can I suggest to try the new l2arc_mfuonly tunable? It should avoid polluting the L2ARC by read-once data.

@mqudsi
Copy link

mqudsi commented Nov 19, 2021

@shodanshok does that also prevent one-time enumeration of metadata from being entered into L2ARC?

@shodanshok
Copy link
Contributor

@mqudsi metadata caching in L2ARC is controller by the secondarycache dataset property. l2arc_mfuonly affect both data and metadata, as it simply set L2ARC to accept evictions from MFU only.

@mqudsi
Copy link

mqudsi commented Nov 22, 2021

Right, so if you have both primarycache=all and secondarycache=all and want to prioritize keeping metadata over data in L2ARC (but still want to try to use the L2ARC for data, just with a lower priority) then in setting l2arc_mfuonly=1 there's a chance that a common workaround like that @Ryushin posted (find /storage > /dev/null) will fail to actually guarantee that after it has run and then there is contention for ARC space, the L2ARC contains the cached metadata for all files, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

No branches or pull requests

9 participants