Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for user/group dnode accounting & quota #3983

Merged
merged 2 commits into from
Oct 7, 2016

Conversation

jxiong
Copy link
Contributor

@jxiong jxiong commented Nov 4, 2015

This patch tracks dnode usage for each user/group in the
DMU_USER/GROUPUSED_OBJECT ZAPs. ZAP entries dedicated to dnode
accounting have the key prefixed with "dn-" followed by the UID/GID
in string format (as done for the block accounting).
A new SPA feature has been added for dnode accounting as well as
a new ZPL version. The SPA feature must be enabled in the pool
before upgrading the zfs filesystem. During the zfs version upgrade,
a "quotacheck" will be executed by marking all dnode as dirty.

ZoL-bug-id: #3500

Signed-off-by: Johann Lombardi [email protected]
Signed-off-by: Jinshan Xiong [email protected]
Change-Id: I899ff446cbf2aa7e355e7d98a83d614f1cd4624b

@jxiong
Copy link
Contributor Author

jxiong commented Nov 4, 2015

Sorry, I realized I kept creating new pull request. Am I supposed to do this or I should update the existing pull request with new patches? And if this is the case, how do I update a pull request?

Sorry again if I did this wrong.

@kernelOfTruth
Copy link
Contributor

@jxiong you can force-push the same tree with the changes that should do it :)

@behlendorf
Copy link
Contributor

@jxiong go ahead and rebase your branch on the latest master then just force update your branch.

git push --force jxiong dnode_quota

@behlendorf behlendorf added this to the 0.7.0 milestone Nov 11, 2015
@jxiong
Copy link
Contributor Author

jxiong commented Nov 12, 2015

@behlendorf that's exactly what I did to push my local branch to github. But my question was once I have an up-to-date branch on github, then I will create a pull request by clicking the button 'create pull request' but a new request number is generated that will lose track of previous request number. I thought there exists something like gerrit that can group these requests together and assign an unique request number. Anyway, it seems I didn't do anything wrong and it's supposed to create a pull request with new number every time.

@behlendorf
Copy link
Contributor

@jxiong when you update your jxiong:dnode_quota branch at github with a force push the buildbot will be immediately notified and the change will be queued up for testing. There's no need to open a new pull request with the change through the github web interface.

If you prefer you can open a new pull request with the refreshed patch instead and close out this one. That's usually the best solution if you want to preserve an older version of a patch.

Unfortunately, github doesn't support the notation of multiple versions of a patch in the same pull request like gerrit.

If you get a chance to rebase this again that would be helpful.

@adilger
Copy link
Contributor

adilger commented Jan 7, 2016

@don-brady could you please review this patch so that it could be landed. Without this patch Lustre dnode accounting is broken.

@don-brady
Copy link
Contributor

Would like to figure out the open-zfs sanctioned way to encapsulate new file system features. Will post back here once I figure out what that is.
Also may want to add Intel copyrights.

@@ -51,7 +51,11 @@ const char *zfs_userquota_prop_prefixes[] = {
"userused@",
"userquota@",
"groupused@",
"groupquota@"
"groupquota@",
"userdnused@",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that "dn" or "dnode" shouldn't be exposed in the user interface. Instead, we should use user-visible concepts like files or directories. If we are to introduce a new concept to users/sysadmins, I think it should be "objects" rather than "dnodes". An object being the conceptual entity, which has some on-disk metadata including the dnode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I will change it. How about userobjused and groupobjused?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

@ahrens
Copy link
Member

ahrens commented Jan 8, 2016

Do you plan to add documentation for this? e.g. manpages, help messages

@@ -827,6 +828,9 @@ dmu_objset_create_impl(spa_t *spa, dsl_dataset_t *ds, blkptr_t *bp,
os->os_phys->os_type = type;
if (dmu_objset_userused_enabled(os)) {
os->os_phys->os_flags |= OBJSET_FLAG_USERACCOUNTING_COMPLETE;
if (dmu_objset_userdnused_enabled(os))
os->os_phys->os_flags |=
OBJSET_FLAG_USERDNACCOUNTING_COMPLETE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, cstyle here is wrong for illumos. Not sure if you care for ZoL. it should be:

<2 tabs>if (...) {
<3 tabs>os->os_phys->os_flags |=
<3 tabs + 4 space>OBJSET_...;
<2 tabs>}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix it.

@ahrens
Copy link
Member

ahrens commented Jan 8, 2016

I don't see where the new quotas are being enforced.

@jxiong
Copy link
Contributor Author

jxiong commented Jan 8, 2016

@ahrens - this patch only accounts objects and enforcement will be implemented in a separate patch.

Thanks for inspection, I will fix the issues you mentioned and push a new patch soon.

@adilger
Copy link
Contributor

adilger commented Jan 11, 2016

@ahrens could you please comment on the right feature handling should be for this patch. Since this is a new ZPL feature, it isn't clear that it should have only a SPA-level feature flag (though I think this is also necessary because it adds flags to the dnode).

However, incrementing ZPL_VERSION is AFAIK not the right thing to do, what is the right mechanism to handle ZPL new features?

@ahrens
Copy link
Member

ahrens commented Jan 11, 2016

Assuming that the plan is to enforce the quotas, then the ZPL change would be to store the quotas.

Ideally, now would be the time to implement ZPL feature flags (like SPA feature flags), with this being the first one. If that’s too much work (which would be understandable), bumping the ZPL version would be reasonable. For better incompatibility detection with send streams to/from Solaris, we should probably reserve a range of version numbers for proprietary forks, e.g. versions 6 to 999, and then use version 1000 for your new feature.

Another option would be to not change the ZPL version number, allowing it to be received on systems that don’t know about user/group file count quotas. AFAICT, the only ill effect would be that the space used by the user/group file count quota objects would be unused (i.e. temporarily leaked until the filesystem is deleted). That doesn’t seem like a big deal, since these objects are likely to be very small. This would be the simplest approach. You'd want to test sending to older systems to verify that we haven't missed anything, but I'm pretty sure the older system would just ignore the quotas.

@bzzz77
Copy link
Contributor

bzzz77 commented Feb 11, 2016

is it correct that in-flight (not-committed-yet) changes aren't visible and one can easily overcommit a lot? say, the hardware is capable to do 100K creates/sec, commit timeout is set to 5s, then the user can exceed by 500K objects?

@jxiong
Copy link
Contributor Author

jxiong commented Feb 12, 2016

@bzzz77 There is no object quota enforcement yet. This is indeed a problem because quota objects are updated at txg sync time.

@jxiong jxiong force-pushed the dnode_quota branch 2 times, most recently from c724856 to 76004c7 Compare February 13, 2016 04:18
@jxiong
Copy link
Contributor Author

jxiong commented Feb 18, 2016

There are a few test failure with debian build, and they share the same error message:

+ sudo -E zfs.sh
zfs.sh: Unload these modules with 'zfs.sh -u':
zfs zcommon zunicode znvpair zavl spl

Can anyone tell me how to look into this? Thanks

@behlendorf
Copy link
Contributor

Usually the stdio log from the failing test is a good place to start. The test cases are also designed to be fairly easy to run locally.

runurl https://raw.githubusercontent.com/zfsonlinux/zfs-buildbot/master/scripts/bb-test-zconfig.sh

@jxiong jxiong force-pushed the dnode_quota branch 2 times, most recently from a3feb1c to afaa962 Compare February 24, 2016 23:24
@jxiong jxiong force-pushed the dnode_quota branch 2 times, most recently from 36e9ca6 to 19e6089 Compare March 21, 2016 21:41
@behlendorf
Copy link
Contributor

@jxiong the zfs test suite failures can be resolved by adding this new feature flag to tests/zfs-tests/tests/functional/cli_root/zpool_get/zpool_get.cfg.

@jxiong
Copy link
Contributor Author

jxiong commented Sep 22, 2016

Yes, that makes sense. So the provided way to handle this is to list any dependencies as the last argument to zfeature_register() so they can be enabled. Take a look at what was done for large dnodes here. This patch needs the same fix and a minor update to man/man5/zpool-features.5 showing this dependency.

It's not necessary a dependency here - I would say this is an implementation problem of SPA_FEATURE_EXTENSIBLE_DATASET where it doesn't check if the feature is enabled before trying to operate it.

@jxiong
Copy link
Contributor Author

jxiong commented Sep 22, 2016

I've done the upgrade test with 1M files and 100 clones that makes to 100M objects to scan. The result is pretty positive.

The upgrade process can be complete in a few minutes, and these are the snapshots of CPU utilization and txg sync time during the upgrade.

1732     812900658876     C     1425408      184320       1898496      44       146      1238087      1929021      1002310139   11897188    
1733     812901896963     C     383074304    24354816     195020800    5752     4749     1016140567   1793804      8944         1795831952  
1734     813918037530     C     1671168      315392       1964032      65       176      1804680      20742        1795812867   19102802    
1735     813919842210     C     686555136    37662720     348612608    8997     3073     1814940464   1462         4174         3215000903  
1736     815734782674     C     1219821568   39792640     618479616    9103     5379     3215013818   5140         34567        4562966300  
1737     818949796492     C     344064       0            835584       0        165      41520        18492376     4544475226   1459279     
1738     818949838012     C     1428996096   0            724422656    0        5941     4564432551   1380         3849         3244181970  
1739     823514270563     C     0            0            0            0        0        3244192255   1992         3623         70552       
1740     826758462818     C     0            0            0            0        0        78990        1065         4432         72006       
1741     826758541808     C     67125248     3451392      35303936     834      788      160899492    1926         12527        197324502   
1742     826919441300     C     73072640     6443008      38508544     1437     549      197342718    1768         4608         270667923   
1743     827116784018     C     688128       86016        1202176      21       140      8342         29496        270639887    4234264     
1744     827116792360     C     108150784    6569984      55859200     1496     582      274908185    1480         5747         407144305   
1745     827391700545     C     155795456    11030528     79820800     2637     1089     407156314    1290         5439         683679852   
1746     827798856859     C     240435200    16297984     122775552    3853     1373     683690648    1335         6027         1035756194  
1747     828482547507     C     368803840    24141824     187699200    5658     2082     1035768010   1614         4296         1440510843  
1748     829518315517     C     688128       24576        1163264      6        170      8601         31241        1440480845   2659305     
1749     829518324118     C     514818048    32927744     261695488    7755     2271     1443175252   1718         4934         2170657628  
1750     830961499370     C     688128       53248        1323008      13       183      8949         61562        2170596968   3435422     
1751     830961508319     C     789790720    27688960     400734208    6421     4182     2174098158   5472         4720         2761343932  
1752     833135606477     C     770048       0            1380352      0        117      317049       1054         2761039197   2446882     
1753     833135923526     C     784154624    0            397838336    0        3186     2763504927   1597         3928         1724829115  
1754     835899428453     C     0            0            0            0        0        1724840969   1609         3920         76067       
1755     837624269422     C     0            0            0            0        0        102035       1542         4954         55525       
1757     842623848959     C     0            0            0            0        0        4999998370   3433         43247        57187       
1758     847623847329     C     0            0            0            0        0        4999999086   2867         43037        51012       
1759     852623846415     C     0            0            0            0        0        4999996550   2847         43512        51852       
1760     857623842965     C     0            0            0            0        0        4999998055   2787         38308        50927       
1761     862623841020     C     0            0            0            0        0        4999999623   3082         43777        51965  

I also included the txg time after upgrade is done for comparison.

 3478 root      20   0       0      0      0 R  94.6  0.0   2:28.68 txg_sync                                                         
 4003 root      20   0       0      0      0 D  36.8  0.0   1:35.78 z_upgrade                                                        
 3987 root      20   0       0      0      0 D  32.9  0.0   1:35.75 z_upgrade                                                        
  990 root      20   0       0      0      0 S  28.5  0.0   1:14.00 dbu_evict                                                        
 3999 root      20   0       0      0      0 D  22.2  0.0   1:35.85 z_upgrade                                                        
 3414 root      20   0       0      0      0 D  21.6  0.0   1:36.56 z_upgrade                                                        
 3991 root      20   0       0      0      0 D  21.6  0.0   1:35.64 z_upgrade                                                        
 3931 root      20   0       0      0      0 D  16.3  0.0   1:36.30 z_upgrade                                                        
 3995 root      20   0       0      0      0 D  11.6  0.0   1:35.91 z_upgrade                                                        
 3962 root      20   0       0      0      0 D   8.6  0.0   1:36.08 z_upgrade    

The CPU utilization went really high at the beginning of upgrade and it dropped to this numbers after a while.

I would say the upgrade process is really smoothy.

@jxiong
Copy link
Contributor Author

jxiong commented Sep 22, 2016

Of course I used a high performance drive for testing. You would expect lower performance on sluggish HDDs.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How were you planning to trigger the upgrade for a mounted Lustre dataset? You might want to move that upgrade trigger out of the zpl layer and in to shared core in the dsl layer.

Things are definitely much smoother in the latest version. As a test I created a simple striped pool with 8 conventional HDDs and a dataset with 1.4M files. I created 1000 clones gives me roughly 1.4B files to upgrade.

For this configuration I observed the following:

  • Upgrade process took roughly 2 hours.
  • I was able to safely unmount/mount a filesystem while upgrading. The upgrade was safely stopped and restarted when remounted.
  • The ARC cache hit rate was excellent 99%+.
  • The z_upgrade taskq threads often consumed 100% of their CPU. This isn't too surprising given the cache hit rate above.
  • The longest txg_sync timed I observed was 45 seconds. That's still longer than we'd like but that average was much closer to 8 seconds, and the larger values were fairly rare.
  • The upgrade wasn't 100% IO bound. There were idle periods where dbu_evict would take 100% of the CPU and evict half the ARC. The upgrade would then continue smoothly, this behavior is interesting but not caused by this patch.
  • Total performance was somewhat degraded during the upgrade but the system was still very usable.
  • Interactive performance remained good.

Given how stressful this test case was it went remarkably smoothly. As soon as these last few issues get wrapped up and we get another review (or two) it should be ready to merge.

l l .
GUID org.zfsonlinux:userobj_accounting
READ\-ONLY COMPATIBLE yes
DEPENDENCIES none
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEPENDENCIES extensible_dataset

zfeature_register(SPA_FEATURE_USEROBJ_ACCOUNTING,
"org.zfsonlinux:userobj_accounting", "userobj_accounting",
"User/Group object accounting.",
ZFEATURE_FLAG_READONLY_COMPAT | ZFEATURE_FLAG_PER_DATASET, NULL);
Copy link
Contributor

@behlendorf behlendorf Sep 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change resolved the VERIFY I hit. When userobj_accounting is enabled extensible_datasets will also get enabled if it already hasn't been.

diff --git a/module/zfs/zfeature_common.c b/module/zfs/zfeature_common.c
index 9de24d6..9c129da 100644
--- a/module/zfs/zfeature_common.c
+++ b/module/zfs/zfeature_common.c
@@ -253,8 +253,15 @@ zpool_feature_init(void)
            "Variable on-disk size of dnodes.",
            ZFEATURE_FLAG_PER_DATASET, large_dnode_deps);
        }
+       {
+       static const spa_feature_t userobj_accounting_deps[] = {
+               SPA_FEATURE_EXTENSIBLE_DATASET,
+               SPA_FEATURE_NONE
+       };
        zfeature_register(SPA_FEATURE_USEROBJ_ACCOUNTING,
            "org.zfsonlinux:userobj_accounting", "userobj_accounting",
            "User/Group object accounting.",
-           ZFEATURE_FLAG_READONLY_COMPAT | ZFEATURE_FLAG_PER_DATASET, NULL);
+           ZFEATURE_FLAG_READONLY_COMPAT | ZFEATURE_FLAG_PER_DATASET,
+           userobj_accounting_deps);
+       }
 }
``` @

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@jxiong
Copy link
Contributor Author

jxiong commented Sep 23, 2016

How were you planning to trigger the upgrade for a mounted Lustre dataset? You might want to move that upgrade trigger out of the zpl layer and in to shared core in the dsl layer.

osd-zfs will be an user of dataset just as ZPL. As long as I have exported symbol dmu_objset_userobjspace_upgrade, I can invoke it somewhere in the code, or it can be triggered from a procfs since it's just a thing once in the lifetime.

@behlendorf
Copy link
Contributor

I can invoke it somewhere in the code, or it can be triggered from a procfs since it's just a thing once in the lifetime.

From a user's point of view if all they need to do is enable the feature flag that would be best. It's definitely what our admin would prefer when updating the clusters. And it means you wouldn't need to add an autoconf check for dmu_objset_userobjspace_upgrade() to Lustre.

@jxiong
Copy link
Contributor Author

jxiong commented Sep 23, 2016

From a user's point of view if all they need to do is enable the feature flag that would be best. It's definitely what our admin would prefer when updating the clusters.

Agreed, so I will make osd-zfs to upgrade the targets without the interaction from admins. If we're going to move auto upgrade into dsl layer, where would you suggest to add the code w/o adding extra complexity? The ideal location would be where is not super hot and it's known that the dataset is owned.

And it means you wouldn't need to add an autoconf check for dmu_objset_userobjspace_upgrade() to Lustre.

Sigh - we're going to need this anyway because osd-zfs has to know where to fetch the object accounting info.

@behlendorf
Copy link
Contributor

where would you suggest to add the code w/o adding extra complexity?

My first inclination would be to place it in dmu_objset_sync(), right after the existing dmu_objset_userused_enabled() check. This isn't a super hot code path, we can easily afford a single extra conditional per-objset per-txg_sync. Since this objset is actively being synced we know it must still be owned. And any action with dirties the objset in any way will trigger the upgrade.

we're going to need this anyway because osd-zfs has to know where to fetch the object accounting info.

Speaking of which we should make sure whatever interfaces you going to need for Lustre are added and those symbols get exported as part of this patch. The zfs_userspace_many() and zfs_userspace_one() function looks almost exactly like what you need, unfortunately they take a zfs_sb_t.

@jxiong
Copy link
Contributor Author

jxiong commented Sep 23, 2016

Since this objset is actively being synced we know it must still be owned. And any action with dirties the objset in any way will trigger the upgrade.

Actually the dataset may not be owned at the time of dmu_objset_sync(). I met this issue before and this is why I took the code out of it.

Let me reproduce it for you. I made this patch:

diff --git a/module/zfs/dmu_objset.c b/module/zfs/dmu_objset.c
index 7256870..e421392 100644
--- a/module/zfs/dmu_objset.c
+++ b/module/zfs/dmu_objset.c
@@ -1289,6 +1289,13 @@ dmu_objset_sync(objset_t *os, zio_t *pio, dmu_tx_t *tx)
                    offsetof(dnode_t, dn_dirty_link[txgoff]));
        }

+#if defined(_KERNEL)
+       if (os->os_dsl_dataset->ds_owner == NULL) {
+               cmn_err(CE_WARN, "objset is not owned\n");
+               dump_stack();
+       }
+#endif
+
        dmu_objset_sync_dnodes(&os->os_free_dnodes[txgoff], newlist, tx);
        dmu_objset_sync_dnodes(&os->os_dirty_dnodes[txgoff], newlist, tx);

And this is what I got by running zfs-tests.sh script:

[100547.981733] WARNING: objset is not owned

[100547.981780] CPU: 0 PID: 11493 Comm: txg_sync Tainted: P           OE  ------------   3.10.0-327.10.1.el7_lustre.x86_64 #1
[100547.981782] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/20/2014
[100547.981783]  ffff8800b8252800 0000000016837dd0 ffff8800b88ebb70 ffffffff816354d4
[100547.981786]  ffff8800b88ebc48 ffffffffa06085f6 ffff880000000001 ffff8800b88ebbdc
[100547.981787]  ffffffffa06075a0 0000000000000000 0000000000000000 ffffffffa0608610
[100547.981790] Call Trace:
[100547.981945]  [<ffffffff816354d4>] dump_stack+0x19/0x1b
[100547.981970]  [<ffffffffa06085f6>] dmu_objset_sync+0x346/0x360 [zfs]
[100547.981981]  [<ffffffffa06075a0>] ? recordsize_changed_cb+0x20/0x20 [zfs]
[100547.981995]  [<ffffffffa0608610>] ? dmu_objset_sync+0x360/0x360 [zfs]
[100547.982010]  [<ffffffffa061c281>] dsl_dataset_sync+0x71/0x300 [zfs]
[100547.982024]  [<ffffffffa061c215>] ? dsl_dataset_sync+0x5/0x300 [zfs]
[100547.982041]  [<ffffffffa062db43>] dsl_pool_sync+0xa3/0x430 [zfs]
[100547.982059]  [<ffffffffa064882f>] spa_sync+0x2df/0xac0 [zfs]
[100547.982063]  [<ffffffff81647831>] ? ftrace_call+0x5/0x2f
[100547.982067]  [<ffffffff8163cd25>] ? _raw_spin_unlock_irqrestore+0x5/0x40
[100547.982087]  [<ffffffffa065a6d5>] txg_sync_thread+0x3c5/0x620 [zfs]
[100547.982092]  [<ffffffffa0491dba>] ? spl_kmem_free+0x2a/0x40 [spl]
[100547.982111]  [<ffffffffa065a310>] ? txg_init+0x280/0x280 [zfs]
[100547.982114]  [<ffffffffa0493f01>] thread_generic_wrapper+0x71/0x80 [spl]
[100547.982118]  [<ffffffffa0493e90>] ? __thread_exit+0x20/0x20 [spl]
[100547.982195]  [<ffffffff810a5acf>] kthread+0xcf/0xe0
[100547.982198]  [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140
[100547.982202]  [<ffffffff81645bd8>] ret_from_fork+0x58/0x90
[100547.982204]  [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140

And then after a while, the node crashed:

[100547.983598] BUG: unable to handle kernel NULL pointer dereference at 00000000000002a0
[100547.985715] IP: [<ffffffffa0608464>] dmu_objset_sync+0x1b4/0x360 [zfs]
[100547.987258] PGD 11dc1d067 PUD 1070d4067 PMD 0
[100547.988455] Oops: 0000 [#1] SMP
[100547.989493] Modules linked in: zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate dm_mod loop ip6t_rpfilter ip6t_REJECT ipt_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter snd_seq_midi snd_seq_midi_event coretemp crc32_pclmul snd_ens1371 snd_rawmidi snd_ac97_codec ppdev ac97_bus ghash_clmulni_intel snd_seq snd_seq_device snd_pcm aesni_intel lrw gf128mul glue_helper ablk_helper cryptd vmw_balloon sg pcspkr snd_timer i2c_piix4 snd soundcore vmw_vmci shpchp parport_pc parport
[100548.006171]  nfsd binfmt_misc auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 sr_mod cdrom ata_generic pata_acpi sd_mod crc_t10dif crct10dif_generic vmwgfx crct10dif_pclmul crct10dif_common drm_kms_helper ttm crc32c_intel drm ata_piix mptspi serio_raw scsi_transport_spi libata mptscsih mptbase e1000 i2c_core vmhgfs(OE) [last unloaded: zunicode]

Even os->os_dsl_dataset could be NULL.

I don't know how this could happen though. I would appreciate for your and @ahrens 's input.

@jxiong
Copy link
Contributor Author

jxiong commented Sep 25, 2016

it turned out that the objset can only be mos for os->os_dsl_dataset to be NULL in dmu_objset_sync(), and only newly created dataset could be in disowned state in the same function. Please let me know if there are other cases I have missed here.

I would say it's doable to launch upgrade thread in dmu_objset_sync(), but we need to take caution when to stop upgrade thread in dmu_objset_disown() because txg writeback could happen after the objset is disowned, so there will be some race condition that should be handled carefully.

Copy link
Member

@ahrens ahrens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that when we run zfs userspace, we need to initiate gathering the new info, and also wait for that to complete. If we do not wait, then we are showing them inaccurate info.

@jxiong
Copy link
Contributor Author

jxiong commented Sep 29, 2016

I think that when we run zfs userspace, we need to initiate gathering the new info, and also wait for that to complete. If we do not wait, then we are showing them inaccurate info.

@ahrens can you please point out the code in question?

@ahrens
Copy link
Member

ahrens commented Sep 29, 2016

@jxiong It looks like unfortunately this functionality was removed when it was rewritten from python to C. In the old python implementation, userland would check the "useraccounting" property to determine if the accounting had been gathered, and if not, it would do the ZFS_IOC_USERSPACE_UPGRADE ioctl. This is from do_userspace() in userspace.py:

    if not ds.getprop("useraccounting"):
        print(_("Initializing accounting information on old filesystem, please wait..."))
        ds.userspace_upgrade()

It does this before trying to get any of the accounting values from the kernel. If this code is not run (e.g. because it was removed when porting to C), and the accounting values have not yet been fully gathered, then the ZFS_IOC_USERSPACE_MANY ioctl will fail with ENOTSUP. This happens in zfs_userspace_many().

@behlendorf
Copy link
Contributor

This is something @ahrens and I talked about at the OpenZFS summit. The concern was that zfs userspace would report partial object quota's to the user while the update was still in progress. We want to make sure this interface never returns the wrong value so the suggestion was to either:

  • Update zfs userspace so it blocks until the update is complete. This could be done in zfs_userspace_many() and zfs_ioc_userspace_one(), or
  • Update the zfs userspace status to indicate the upgrade is still in progress and return nothing.

Looking at the code it does look like you're already handling this inzfs_userspace_one() and zfs_userspace_many by returning ENOTSUP like the original code. But this must not be working because I was definitely able to see partial objquota values when running zfs userspace in my manual testing.

@jxiong
Copy link
Contributor Author

jxiong commented Sep 30, 2016

@behlendorf @ahrens I think the current implementation is already handling this case well. Taking a look at the implementation of zfs_userspace_one():

        if ((type == ZFS_PROP_USEROBJUSED || type == ZFS_PROP_GROUPOBJUSED ||
            type == ZFS_PROP_USEROBJQUOTA || type == ZFS_PROP_GROUPOBJQUOTA) &&
            !dmu_objset_userobjspace_present(zsb->z_os))
                return (SET_ERROR(ENOTSUP));

It checks if userobj accounting feature is present before going forward, and it can only be present after the upgrade process is complete:

dmu_objset_userobjspace_upgrade_cb() {
        ...
        os->os_flags |= OBJSET_FLAG_USEROBJACCOUNTING_COMPLETE;
        txg_wait_synced(dmu_objset_pool(os), 0);
        return (0);
}

But this must not be working because I was definitely able to see partial objquota values when running zfs userspace in my manual testing.

I guess what you have seen might be the case that the objects had created in memory but the corresponding txg hadn't been synced yet?

@behlendorf
Copy link
Contributor

I guess what you have seen might be the case that the objects had created in memory but the corresponding txg hadn't been synced yet?

I don't think so because I wasn't creating any new files in the filesystem while doing the upgrade. Although that would be a good test case. And I agree the code does look like it should handle this case. The only thing I was doing concurrently was mounting/unmounting the filesystem which may be related.

@jxiong
Copy link
Contributor Author

jxiong commented Sep 30, 2016

I don't think so because I wasn't creating any new files in the filesystem while doing the upgrade. Although that would be a good test case. And I agree the code does look like it should handle this case. The only thing I was doing concurrently was mounting/unmounting the filesystem which may be related.

It turned out that the issue is due to the sequence in the syncing context, which calls dmu_objset_sync() prior to dmu_objset_do_userquota_updates() therefore there is a small window that the present flag is seen when there are still some pending updates to the userobj accounting object. This issue should be easier to be seen when #4642 is not in place, and that is probably when you saw the issue.

However, I tend to think this issue doesn't have to be fixed at all because it's naturally inaccurate for this count in zfs.

@behlendorf
Copy link
Contributor

It turned out that the issue is due to the sequence in the syncing context, which calls dmu_objset_sync() prior to dmu_objset_do_userquota_updates() therefore there is a small window that the present flag is seen when there are still some pending updates to the userobj accounting object.

That would make sense, and explain why it was fairly rare for me to observe this issue. And when I did observe it why the values were always fairly close to the expected values. This issue would then also exist in the existing quota upgrade logic.

OK, then I this should be ready to merge. But it would be great if we could get get at least one other reviewer to approve this.

@behlendorf
Copy link
Contributor

@jxiong can you rebase of this patch one last time to resolve the minor conflicts with 5cc78dc which was just merged.

Jinshan Xiong added 2 commits October 4, 2016 11:46
This patch tracks dnode usage for each user/group in the
DMU_USER/GROUPUSED_OBJECT ZAPs. ZAP entries dedicated to dnode
accounting have the key prefixed with "obj-" followed by the UID/GID
in string format (as done for the block accounting).
A new SPA feature has been added for dnode accounting as well as
a new ZPL version. The SPA feature must be enabled in the pool
before upgrading the zfs filesystem. During the zfs version upgrade,
a "quotacheck" will be executed by marking all dnode as dirty.

ZoL-bug-id: openzfs#3500

Signed-off-by: Jinshan Xiong <[email protected]>
Signed-off-by: Johann Lombardi <[email protected]>
…ta_updates

Using a benchmark which creates 2 million files in one TXG, I observe
that the thread running spa_sync() is on CPU almost the entire time we
are syncing, and therefore can be a performance bottleneck. About 50% of
the time in spa_sync() is in dmu_objset_do_userquota_updates().

The problem is that dmu_objset_do_userquota_updates() calls
zap_increment_int(DMU_USERUSED_OBJECT) once for every file that was
modified (or created). In this benchmark, all the files are owned by the
same user/group, so all 2 million calls to zap_increment_int() are
modifying the same entry in the zap. The same issue exists for the
DMU_GROUPUSED_OBJECT.

We should keep an in-memory map from user to space delta while we are
syncing, and when we finish, iterate over the in-memory map and modify
the ZAP once per entry. This reduces the number of calls to
zap_increment_int() from "number of objects modified" to "number of
owners/groups of modified files".

This reduced the time spent in spa_sync() in the file create benchmark
by ~33%, from 11 seconds to 7 seconds.

Upstream bugs: DLPX-44799
Ported by: Ned Bass <[email protected]>

OpenZFS-issue: https://www.illumos.org/issues/6988
ZFSonLinux-issue: openzfs#4642
OpenZFS-commit: unmerged

Porting notes:
- Added curly braces around declaration of userquota_cache_t cache to
  quiet compiler warning;
- Handled the userobj accounting the same way it proposed in this path.

Signed-off-by: Jinshan Xiong <[email protected]>
@behlendorf behlendorf merged commit 9b7a83c into openzfs:master Oct 7, 2016
@behlendorf
Copy link
Contributor

@jxiong merged, thanks for all your hard work on this!

@jxiong jxiong deleted the dnode_quota branch October 7, 2016 20:26
@tomoyat1
Copy link

tomoyat1 commented Feb 3, 2017

I have been trying to port this over to FreeBSD for my needs, but I've ran into the taskq features specific to SPL. (taskq_wait_id() and taskq_cancel_id())
Are there currently any ideas or plans on how the parts using those features will be ported to other platforms, or perhaps any plans to get this PR upstreamed to illumos / ported to other OpenZFS implmenetations? I am willing to help out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants