Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filling special vdevs #15226

Open
Haravikk opened this issue Aug 31, 2023 · 19 comments
Open

Filling special vdevs #15226

Haravikk opened this issue Aug 31, 2023 · 19 comments
Labels
Type: Feature Feature request or new feature

Comments

@Haravikk
Copy link

Describe the feature would like to see added to OpenZFS

I would like to see a new command, or possibly an option added to zfs scrub that allows records to be transferred/copied onto special devices if capacity allows.

Essentially this would scan the dataset(s) for records within the special record size, and copy/move them to the special device so they can immediately benefit from the faster device.

How will this feature improve OpenZFS?

Currently when a special device is added to an existing pool, there is no immediate benefit, as any existing records that could be on the special device will remain where they are. This means special devices really need to be added when a pool is created, or dataset(s) need to be re-created in order to gain the full benefits.

Otherwise only new records will be stored to the special device, and thereby benefit from any increased speed.

Additional context

I'm creating this request as an extension of issue #15118 ("write-through" special devices) which is why I mention transfer/copy as this feature within the context of a "safe" special device (no redundancy required) would involve copying rather than moving qualifying records.

However they should be considered independent as while this is more important for #15118 (as reloading will restore lost performance), it isn't critical that "reloading" be implemented first.

@Haravikk Haravikk added the Type: Feature Feature request or new feature label Aug 31, 2023
@kiler129
Copy link

kiler129 commented Sep 6, 2023

You can use https://github.com/markusressel/zfs-inplace-rebalancing as a stop-gap solution. Due to how it works however, you should ensure no application is able to modify the files when the operation is in progress. I'm currently working on a more robust version, but atomicity in userspace isn't 100% possible.

@GregorKopka
Copy link
Contributor

This already exists, it's available via zfs send A | zfs recv A.temp; zfs destroy A; zfs rename A.temp A (additional flags may be required for your use case) as an offline-means of doing what you ask for.

@kiler129
Copy link

This already exists, it's available via zfs send A | zfs recv A.temp; zfs destroy A; zfs rename A.temp A (...)

You're missing the point completely. Try that with a large dataset - it's unrealistic to have a spare 30TB to do that. The point of in-place rebalance is to do it without needing to have as much free space as your data takes.

@GregorKopka
Copy link
Contributor

You're missing the point completely. Try that with a large dataset - it's unrealistic to have a spare 30TB to do that. The point of in-place rebalance is to do it without needing to have as much free space as your data takes.

Try your approach with any dataset that has a somewhat recent snapshot which (for reasons) can't be dropped instantly: it'll double in allocated size and stay that way. No way around that without block pointer rewrite.

@Haravikk
Copy link
Author

Try your approach with any dataset that has a somewhat recent snapshot which (for reasons) can't be dropped instantly: it'll double in allocated size and stay that way. No way around that without block pointer rewrite.

That's only true if you do indeed copy every single file, but if the goal is preloading a special device then there are some tweaks that can be made.

But again you seem to be missing the point; the entire purpose of this issue is to avoid the need for either of these options (in-place copying or entire dataset sending). ZFS already knows which records could be moved/copied onto the special device(s), we just need a way to do it that can be triggered separately from ordinary writing.

@GregorKopka
Copy link
Contributor

What you ask for is requiring block pointer rewrite.

Please read up on that topic before jumping to unfounded conclusions about anyone missing points.

@Haravikk
Copy link
Author

Haravikk commented Sep 30, 2023

What you ask for is requiring block pointer rewrite.

It's only required if the data is actually moved from the vdev to the special device; if it's only copied between them then this should be less challenging. Most users aren't going to care whether the data is moved rather than copied, as the goal is to improve performance; this issue was also posted following #15118 where copying is the goal anyway.

The fact that it's possible to migrate the data via "hacks" (zfs send/receive or "in place" copying) means ZFS can already do this without block pointer rewriting, it just needs to do it automatically and only for the required data, i.e- identify special records, "copy" them into new (relocated) records and then retire the old ones – no block pointer rewriting necessary. And since we're talking about a newly added device there shouldn't be any capacity concerns (worst case scenario is you don't gain any additional capacity until you delete some snapshots).

@GregorKopka
Copy link
Contributor

As the whole point of the exercise is to move existing data to a different vdev - which, as of the design decision that was taken at the beginning of ZFS to both never ever overwrite data on-disk and have a strong checksum chain up to the uberblocks, is technically impossible without block pointer rewrite - you ask for something that (at this point in time and for the foreseeable future) is simply not possible.

With your argument of "Most users aren't going to care whether the data is moved rather than copied" you also invalidate any need for this feature, which basically also just boils down to a glorified wrapper around a bash script that fits onto a 80s aera text display.

As of your argument that this would be just a follow up to #15118: please see #15118 (comment)

@Haravikk
Copy link
Author

Haravikk commented Oct 2, 2023

As the whole point of the exercise is to move existing data to a different vdev

That's not the point at all, as moving the data isn't required in the short term; the key goal is to "preload" a new special device so it's filled with data as quickly as possible, as the primary purpose of a special device is to accelerate reads/writes to the array, the added capacity is a secondary benefit (and with #15118, potentially unwanted).

While it would be nice if the data is moved, so that capacity can be freed up (as with recreating the pool/datasets) it's not a requirement for this feature to function. If we ever get proper block pointer rewriting it can be used to change this to a proper move, but the feature in no way requires that to be implemented first.

@GregorKopka
Copy link
Contributor

Please read the comment I linked (and the link inside it).

That would give you what you desire, with a technically feasible approach that does not require changing fundamentals of ZFS nor having to ignore a significant part of the ZFS feature set.

@Haravikk
Copy link
Author

Haravikk commented Oct 2, 2023

Please read the comment I linked (and the link inside it). That would give you what you desire

I did, and no it wouldn't

that does not require changing fundamentals of ZFS nor having to ignore a significant part of the ZFS feature set

Who said anything about either? ZFS already has and manages copies of data (it is one of the most fundamental parts of ZFS). At most the only new capability that might be required is the ability to indicate a "new" record is a redundant copy of another, so it can be skipped by zfs send, but that would be a generally useful feature to have, especially if block pointer rewriting is going to continue to be a blocker to useful features for the foreseeable future.

@GregorKopka
Copy link
Contributor

To indicate a "new" record being a copy of another requires to rewrite the block pointer(s) toward the "old" record, else pool free space accounting will fail and lead to destruction of the pool.

Would you please notice the word "rewrite" there and from that figure out what consequences it has?

@Haravikk
Copy link
Author

Haravikk commented Oct 6, 2023

To indicate a "new" record being a copy of another requires to rewrite the block pointer(s) toward the "old" record, else pool free space accounting will fail and lead to destruction of the pool.

No it doesn't; while block pointer rewriting would be beneficial as it would avoid the need for "new" records to do this initially, it's also a capability that can be add later when (if) block pointer rewrite is added.

I'm getting really tired of having to point out to you that features you keep shooting down as "impossible" without block pointer rewrite are simply intended to automate what can already be done via copying. If it can be done by wastefully copying everything, then it can be done by more precisely copying only what it is needed.

If it can later be done by only rewriting existing blocks then great, that'll be a fantastic benefit of that feature should it ever see the light of day. But a potential a future feature that may never get implemented can't keep being used as an excuse not to implement other useful features.

@bghira
Copy link

bghira commented Oct 7, 2023

@GregorKopka that seems unnecessarily hostile

as it is, vdev removal is a thing now that was also considered to not be technically possible without block pointer rewrite.

more conducive dialog would be nice, such as the limitations of the vdev removal approach and how it might be applicable here.

@GregorKopka
Copy link
Contributor

GregorKopka commented Oct 8, 2023

that seems unnecessarily hostile

Hostilities started right here, after I pointed out technical problems with the suggestion.

the limitations of the vdev removal approach

Please see #9129 and related about the limitations of vdev removal and what could possibly be done about them.

All current (and, AFAIK, so far suggested) approaches to vdev removal work on a completely different layer of the ZFS stack and (while they could be, in theory, augmented to distribute blocks from the to-be-removed vdev by some additional criterias onto different vdevs) add at least one level of indirection, that not only increases required storage space (from mapping the old DVA to the new one), but also needs to be taken on every access (which impacts performance by increasing latency and reducing available IO bandwidth of the bare metal drives).

From this I do not see how taking that route could help for the stated goal of this issue: moving data onto other vdevs in the case that there is not enough space in the pool to hold a temporary copy of the dataset.

@Haravikk
Copy link
Author

Haravikk commented Oct 8, 2023

Hostilities started #15226 (comment), after I pointed out technical problems with the suggestion.

That comment is in no way hostile whatsoever; even if it were (which it isn't) that wouldn't excuse your behaviour. You've been trolling multiple issues, arguing in bad faith, creating straw-man arguments, insulting people, and flat out just lying about what others have asked for, case in point:

From this I do not see how taking that route could help for the stated goal of this issue: moving data onto other vdevs in the case that there is not enough space in the pool to hold a temporary copy of the dataset.

Once again, the priority with filling a special vdev is not moving the data, it's getting the data onto the special vdev, the actual mechanism doesn't matter; if a lack of block pointer rewriting means that the relevant data needs to be copied rather than moved then it simply needs to copied.

While the "vdev remap" layer could be usable it's again not required, and may be overkill for the purpose. This feature request is asking for something we can already do ourselves (copy the files), except that having ZFS do it would allow it to be done more precisely and automatically (eliminating the need for users to somehow identify the files to copy themselves). While it would be great if it could be done by moving, if it's going to be too costly/difficult then it's not a requirement and never has been.

I have now tried multiple times to remind you of what the actual goal of this issue is, I have to assume you are aware by now, yet you are still trying to tell everyone (including the issue's author) that they want something else. Why are you here if you don't care what the issue even is? Feel free to ignore it and related issues instead.

@GregorKopka
Copy link
Contributor

Why are you here if you don't care what the issue even is?

Instead of assuming (ass-u-me: makeing an ass out of you and me) that I'm a troll who just wants to fuck with you out of pure boredom... you could maybe think for a moment about potential other reasons why someone who is here 10+ years repeatedly tries to point out technical and practical shortcomings perceived with feature(s) you suggest.

@Haravikk
Copy link
Author

Haravikk commented Oct 8, 2023

Instead of assuming (ass-u-me: makeing an ass out of you and me) that I'm a troll who just wants to fuck with you out of pure boredom... you could maybe think for a moment about potential other reasons why someone who is here 10+ years repeatedly tries to point out technical and practical shortcomings perceived with feature(s) you suggest.

And yet it holds true, as here you are with more insults.

As I've already pointed out multiple times you haven't been raising "technical shortcomings" with this feature, you've been making false assumptions about what the feature's purpose is so you can shoot down a version of it that nobody actually asked for; if you don't want to give feedback on the feature that was actually requested then all you're doing is making straw-man arguments in bad faith that we all could very much do without.

@kiler129
Copy link

kiler129 commented Oct 8, 2023

I'm hesitant to respond here, as the atmosphere definitely went from productive to personal sandbox*

On the technical side of things I have to agree with @Haravikk, especially on "This feature request is asking for something we can already do ourselves (copy the files), except that having ZFS do it would allow it to be done more precisely and automatically": this hits the nail on the head and more. It's impossible in the userspace to ensure safety of this operation. With Linux removing mandatory locks, unless one can ensure nothing is using the dataset.... it's not safe to perform the operation. It's orders of magnitude more dangerous to do it in userspace, not to mention the efficiency.

On top of that, there are many scenarios in my line of work where moving the data isn't really the main objective but ensuring that special vdev is filled in to speed up the application. Currently my production system uses ~1.4% of data contents for special vdev. If I had to settle for wasting 1.5% of space on the main pool so that it's copied to the special vdev before the data rotates it would be perfectly acceptable. It would be great if the data was truly moved but it's more a cherry on the top in my usecase.

Yes, doing this 100% perfectly is complex (or maybe even not feasible without large changes, as you guys pointed out). However, perfect is the enemy of good. There are tradeoffs everywhere and as long as these are sane and documented most of the people are ok with that. ZFS has many of them: it will probably use more space, it will use more CPU, and cause higher I/O usage... but in return provide a lot of benefits over FAT32 (yes, I'm deliberately hyperbolic here). It's a tradeoff.


*Offtopic:
I am not a contributor of OpenZFS yet, but I am actively watching all monthly meetings. I am also trying to dabble my fingers in the codebase to understand it better. [I think] I do not have sufficient knowledge of the project to contribute yet in terms of code. However, this issue honestly worries me. It is a discussion between two experience community members who began to disagree on technical details but ended up with a hostile exchange.... I will not point fingers, it's not productive.... but it makes me hesitant to try to help, especially in the beginning where my work will be subpar and will probably miss a lot of details and nuances.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

No branches or pull requests

4 participants