-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for user/group dnode accounting & quota #3983
Conversation
Sorry, I realized I kept creating new pull request. Am I supposed to do this or I should update the existing pull request with new patches? And if this is the case, how do I update a pull request? Sorry again if I did this wrong. |
@jxiong you can force-push the same tree with the changes that should do it :) |
@jxiong go ahead and rebase your branch on the latest master then just force update your branch.
|
@behlendorf that's exactly what I did to push my local branch to github. But my question was once I have an up-to-date branch on github, then I will create a pull request by clicking the button 'create pull request' but a new request number is generated that will lose track of previous request number. I thought there exists something like gerrit that can group these requests together and assign an unique request number. Anyway, it seems I didn't do anything wrong and it's supposed to create a pull request with new number every time. |
@jxiong when you update your jxiong:dnode_quota branch at github with a force push the buildbot will be immediately notified and the change will be queued up for testing. There's no need to open a new pull request with the change through the github web interface. If you prefer you can open a new pull request with the refreshed patch instead and close out this one. That's usually the best solution if you want to preserve an older version of a patch. Unfortunately, github doesn't support the notation of multiple versions of a patch in the same pull request like gerrit. If you get a chance to rebase this again that would be helpful. |
@don-brady could you please review this patch so that it could be landed. Without this patch Lustre dnode accounting is broken. |
Would like to figure out the open-zfs sanctioned way to encapsulate new file system features. Will post back here once I figure out what that is. |
@@ -51,7 +51,11 @@ const char *zfs_userquota_prop_prefixes[] = { | |||
"userused@", | |||
"userquota@", | |||
"groupused@", | |||
"groupquota@" | |||
"groupquota@", | |||
"userdnused@", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that "dn" or "dnode" shouldn't be exposed in the user interface. Instead, we should use user-visible concepts like files or directories. If we are to introduce a new concept to users/sysadmins, I think it should be "objects" rather than "dnodes". An object being the conceptual entity, which has some on-disk metadata including the dnode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I will change it. How about userobjused and groupobjused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me.
Do you plan to add documentation for this? e.g. manpages, help messages |
@@ -827,6 +828,9 @@ dmu_objset_create_impl(spa_t *spa, dsl_dataset_t *ds, blkptr_t *bp, | |||
os->os_phys->os_type = type; | |||
if (dmu_objset_userused_enabled(os)) { | |||
os->os_phys->os_flags |= OBJSET_FLAG_USERACCOUNTING_COMPLETE; | |||
if (dmu_objset_userdnused_enabled(os)) | |||
os->os_phys->os_flags |= | |||
OBJSET_FLAG_USERDNACCOUNTING_COMPLETE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, cstyle here is wrong for illumos. Not sure if you care for ZoL. it should be:
<2 tabs>if (...) {
<3 tabs>os->os_phys->os_flags |=
<3 tabs + 4 space>OBJSET_...;
<2 tabs>}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix it.
I don't see where the new quotas are being enforced. |
@ahrens - this patch only accounts objects and enforcement will be implemented in a separate patch. Thanks for inspection, I will fix the issues you mentioned and push a new patch soon. |
@ahrens could you please comment on the right feature handling should be for this patch. Since this is a new ZPL feature, it isn't clear that it should have only a SPA-level feature flag (though I think this is also necessary because it adds flags to the dnode). However, incrementing |
Assuming that the plan is to enforce the quotas, then the ZPL change would be to store the quotas. Ideally, now would be the time to implement ZPL feature flags (like SPA feature flags), with this being the first one. If that’s too much work (which would be understandable), bumping the ZPL version would be reasonable. For better incompatibility detection with send streams to/from Solaris, we should probably reserve a range of version numbers for proprietary forks, e.g. versions 6 to 999, and then use version 1000 for your new feature. Another option would be to not change the ZPL version number, allowing it to be received on systems that don’t know about user/group file count quotas. AFAICT, the only ill effect would be that the space used by the user/group file count quota objects would be unused (i.e. temporarily leaked until the filesystem is deleted). That doesn’t seem like a big deal, since these objects are likely to be very small. This would be the simplest approach. You'd want to test sending to older systems to verify that we haven't missed anything, but I'm pretty sure the older system would just ignore the quotas. |
is it correct that in-flight (not-committed-yet) changes aren't visible and one can easily overcommit a lot? say, the hardware is capable to do 100K creates/sec, commit timeout is set to 5s, then the user can exceed by 500K objects? |
@bzzz77 There is no object quota enforcement yet. This is indeed a problem because quota objects are updated at txg sync time. |
c724856
to
76004c7
Compare
There are a few test failure with debian build, and they share the same error message:
Can anyone tell me how to look into this? Thanks |
Usually the stdio log from the failing test is a good place to start. The test cases are also designed to be fairly easy to run locally. runurl https://raw.githubusercontent.com/zfsonlinux/zfs-buildbot/master/scripts/bb-test-zconfig.sh |
a3feb1c
to
afaa962
Compare
36e9ca6
to
19e6089
Compare
@jxiong the zfs test suite failures can be resolved by adding this new feature flag to |
It's not necessary a dependency here - I would say this is an implementation problem of |
I've done the upgrade test with 1M files and 100 clones that makes to 100M objects to scan. The result is pretty positive. The upgrade process can be complete in a few minutes, and these are the snapshots of CPU utilization and txg sync time during the upgrade.
I also included the txg time after upgrade is done for comparison.
The CPU utilization went really high at the beginning of upgrade and it dropped to this numbers after a while. I would say the upgrade process is really smoothy. |
Of course I used a high performance drive for testing. You would expect lower performance on sluggish HDDs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How were you planning to trigger the upgrade for a mounted Lustre dataset? You might want to move that upgrade trigger out of the zpl layer and in to shared core in the dsl layer.
Things are definitely much smoother in the latest version. As a test I created a simple striped pool with 8 conventional HDDs and a dataset with 1.4M files. I created 1000 clones gives me roughly 1.4B files to upgrade.
For this configuration I observed the following:
- Upgrade process took roughly 2 hours.
- I was able to safely unmount/mount a filesystem while upgrading. The upgrade was safely stopped and restarted when remounted.
- The ARC cache hit rate was excellent 99%+.
- The z_upgrade taskq threads often consumed 100% of their CPU. This isn't too surprising given the cache hit rate above.
- The longest txg_sync timed I observed was 45 seconds. That's still longer than we'd like but that average was much closer to 8 seconds, and the larger values were fairly rare.
- The upgrade wasn't 100% IO bound. There were idle periods where
dbu_evict
would take 100% of the CPU and evict half the ARC. The upgrade would then continue smoothly, this behavior is interesting but not caused by this patch. - Total performance was somewhat degraded during the upgrade but the system was still very usable.
- Interactive performance remained good.
Given how stressful this test case was it went remarkably smoothly. As soon as these last few issues get wrapped up and we get another review (or two) it should be ready to merge.
l l . | ||
GUID org.zfsonlinux:userobj_accounting | ||
READ\-ONLY COMPATIBLE yes | ||
DEPENDENCIES none |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DEPENDENCIES extensible_dataset
zfeature_register(SPA_FEATURE_USEROBJ_ACCOUNTING, | ||
"org.zfsonlinux:userobj_accounting", "userobj_accounting", | ||
"User/Group object accounting.", | ||
ZFEATURE_FLAG_READONLY_COMPAT | ZFEATURE_FLAG_PER_DATASET, NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change resolved the VERIFY I hit. When userobj_accounting
is enabled extensible_datasets
will also get enabled if it already hasn't been.
diff --git a/module/zfs/zfeature_common.c b/module/zfs/zfeature_common.c
index 9de24d6..9c129da 100644
--- a/module/zfs/zfeature_common.c
+++ b/module/zfs/zfeature_common.c
@@ -253,8 +253,15 @@ zpool_feature_init(void)
"Variable on-disk size of dnodes.",
ZFEATURE_FLAG_PER_DATASET, large_dnode_deps);
}
+ {
+ static const spa_feature_t userobj_accounting_deps[] = {
+ SPA_FEATURE_EXTENSIBLE_DATASET,
+ SPA_FEATURE_NONE
+ };
zfeature_register(SPA_FEATURE_USEROBJ_ACCOUNTING,
"org.zfsonlinux:userobj_accounting", "userobj_accounting",
"User/Group object accounting.",
- ZFEATURE_FLAG_READONLY_COMPAT | ZFEATURE_FLAG_PER_DATASET, NULL);
+ ZFEATURE_FLAG_READONLY_COMPAT | ZFEATURE_FLAG_PER_DATASET,
+ userobj_accounting_deps);
+ }
}
``` @
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
osd-zfs will be an user of dataset just as ZPL. As long as I have exported symbol |
From a user's point of view if all they need to do is enable the feature flag that would be best. It's definitely what our admin would prefer when updating the clusters. And it means you wouldn't need to add an autoconf check for |
Agreed, so I will make osd-zfs to upgrade the targets without the interaction from admins. If we're going to move auto upgrade into dsl layer, where would you suggest to add the code w/o adding extra complexity? The ideal location would be where is not super hot and it's known that the dataset is owned.
Sigh - we're going to need this anyway because osd-zfs has to know where to fetch the object accounting info. |
My first inclination would be to place it in
Speaking of which we should make sure whatever interfaces you going to need for Lustre are added and those symbols get exported as part of this patch. The |
Actually the dataset may not be owned at the time of Let me reproduce it for you. I made this patch:
And this is what I got by running zfs-tests.sh script:
And then after a while, the node crashed:
Even I don't know how this could happen though. I would appreciate for your and @ahrens 's input. |
it turned out that the objset can only be mos for I would say it's doable to launch upgrade thread in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that when we run zfs userspace
, we need to initiate gathering the new info, and also wait for that to complete. If we do not wait, then we are showing them inaccurate info.
@ahrens can you please point out the code in question? |
@jxiong It looks like unfortunately this functionality was removed when it was rewritten from python to C. In the old python implementation, userland would check the "useraccounting" property to determine if the accounting had been gathered, and if not, it would do the ZFS_IOC_USERSPACE_UPGRADE ioctl. This is from
It does this before trying to get any of the accounting values from the kernel. If this code is not run (e.g. because it was removed when porting to C), and the accounting values have not yet been fully gathered, then the |
This is something @ahrens and I talked about at the OpenZFS summit. The concern was that
Looking at the code it does look like you're already handling this in |
@behlendorf @ahrens I think the current implementation is already handling this case well. Taking a look at the implementation of
It checks if userobj accounting feature is present before going forward, and it can only be present after the upgrade process is complete:
I guess what you have seen might be the case that the objects had created in memory but the corresponding txg hadn't been synced yet? |
I don't think so because I wasn't creating any new files in the filesystem while doing the upgrade. Although that would be a good test case. And I agree the code does look like it should handle this case. The only thing I was doing concurrently was mounting/unmounting the filesystem which may be related. |
It turned out that the issue is due to the sequence in the syncing context, which calls However, I tend to think this issue doesn't have to be fixed at all because it's naturally inaccurate for this count in zfs. |
That would make sense, and explain why it was fairly rare for me to observe this issue. And when I did observe it why the values were always fairly close to the expected values. This issue would then also exist in the existing quota upgrade logic. OK, then I this should be ready to merge. But it would be great if we could get get at least one other reviewer to approve this. |
This patch tracks dnode usage for each user/group in the DMU_USER/GROUPUSED_OBJECT ZAPs. ZAP entries dedicated to dnode accounting have the key prefixed with "obj-" followed by the UID/GID in string format (as done for the block accounting). A new SPA feature has been added for dnode accounting as well as a new ZPL version. The SPA feature must be enabled in the pool before upgrading the zfs filesystem. During the zfs version upgrade, a "quotacheck" will be executed by marking all dnode as dirty. ZoL-bug-id: openzfs#3500 Signed-off-by: Jinshan Xiong <[email protected]> Signed-off-by: Johann Lombardi <[email protected]>
…ta_updates Using a benchmark which creates 2 million files in one TXG, I observe that the thread running spa_sync() is on CPU almost the entire time we are syncing, and therefore can be a performance bottleneck. About 50% of the time in spa_sync() is in dmu_objset_do_userquota_updates(). The problem is that dmu_objset_do_userquota_updates() calls zap_increment_int(DMU_USERUSED_OBJECT) once for every file that was modified (or created). In this benchmark, all the files are owned by the same user/group, so all 2 million calls to zap_increment_int() are modifying the same entry in the zap. The same issue exists for the DMU_GROUPUSED_OBJECT. We should keep an in-memory map from user to space delta while we are syncing, and when we finish, iterate over the in-memory map and modify the ZAP once per entry. This reduces the number of calls to zap_increment_int() from "number of objects modified" to "number of owners/groups of modified files". This reduced the time spent in spa_sync() in the file create benchmark by ~33%, from 11 seconds to 7 seconds. Upstream bugs: DLPX-44799 Ported by: Ned Bass <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6988 ZFSonLinux-issue: openzfs#4642 OpenZFS-commit: unmerged Porting notes: - Added curly braces around declaration of userquota_cache_t cache to quiet compiler warning; - Handled the userobj accounting the same way it proposed in this path. Signed-off-by: Jinshan Xiong <[email protected]>
@jxiong merged, thanks for all your hard work on this! |
I have been trying to port this over to FreeBSD for my needs, but I've ran into the taskq features specific to SPL. (taskq_wait_id() and taskq_cancel_id()) |
This patch tracks dnode usage for each user/group in the
DMU_USER/GROUPUSED_OBJECT ZAPs. ZAP entries dedicated to dnode
accounting have the key prefixed with "dn-" followed by the UID/GID
in string format (as done for the block accounting).
A new SPA feature has been added for dnode accounting as well as
a new ZPL version. The SPA feature must be enabled in the pool
before upgrading the zfs filesystem. During the zfs version upgrade,
a "quotacheck" will be executed by marking all dnode as dirty.
ZoL-bug-id: #3500
Signed-off-by: Johann Lombardi [email protected]
Signed-off-by: Jinshan Xiong [email protected]
Change-Id: I899ff446cbf2aa7e355e7d98a83d614f1cd4624b