Skip to content

Commit

Permalink
Add zfs_sb_prune_compat() function
Browse files Browse the repository at this point in the history
For kernels which do not implement a per-suberblock shrinker the
shrink_dcache_parent() function was used to attempt to reclaim
dentries.  This was found not be entirely reliable which could
lead to performance issues on older kernels running meta-data
heavy workloads.

To address this issue a zfs_sb_prune_compat() function has been
added to implement this functionality.  It relies on traversing
the list of znodes for a filesystem and adding them to a private
list with a reference held.  The private list can then be safely
walked outside the z_znodes_lock to prune dentires and drop the
last reference so the inode can be freed.

This provides the same synchronous behavior as the per-filesystem
shrinker and has the advantage of depending on only long standing
interfaces.

The number of threads in the iput taskq has also been increased
to speed up the handling of asynchronous iputs.  This improves
the rate of meta data reclaim regardless of the kernel version.

Signed-off-by: Brian Behlendorf <[email protected]>
  • Loading branch information
behlendorf committed Jun 17, 2015
1 parent 036391c commit 78aa2b1
Show file tree
Hide file tree
Showing 4 changed files with 75 additions and 13 deletions.
1 change: 1 addition & 0 deletions include/sys/zfs_znode.h
Original file line number Diff line number Diff line change
Expand Up @@ -211,6 +211,7 @@ typedef struct znode {
nvlist_t *z_xattr_cached; /* cached xattrs */
struct znode *z_xattr_parent; /* xattr parent znode */
list_node_t z_link_node; /* all znodes in fs link */
list_node_t z_prune_node; /* znodes being pruned link */

This comment has been minimized.

Copy link
@dweeezil

dweeezil Jun 18, 2015

Could we conditionally not compile this. I would be nice to save the space on systems which don't need it. Ditto for its uses in the non-prune function places.

This comment has been minimized.

Copy link
@behlendorf

behlendorf Jun 18, 2015

Author Owner

Absolutely, good thought. I originally wanted to avoid adding this entirely by using zfs_iput_async() but I ran in to lock contention problems when pushing everything through the taskq. But wrapping it with a conditional is almost as good.

sa_handle_t *z_sa_hdl; /* handle to sa data */
boolean_t z_is_sa; /* are we native sa? */
boolean_t z_is_zvol; /* are we used by the zvol */
Expand Down
4 changes: 2 additions & 2 deletions module/zfs/dsl_pool.c
Original file line number Diff line number Diff line change
Expand Up @@ -170,8 +170,8 @@ dsl_pool_open_impl(spa_t *spa, uint64_t txg)
mutex_init(&dp->dp_lock, NULL, MUTEX_DEFAULT, NULL);
cv_init(&dp->dp_spaceavail_cv, NULL, CV_DEFAULT, NULL);

dp->dp_iput_taskq = taskq_create("zfs_iput_taskq", 1, minclsyspri,
1, 4, 0);
dp->dp_iput_taskq = taskq_create("z_iput", max_ncpus, minclsyspri,
max_ncpus * 8, INT_MAX, TASKQ_PREPOPULATE);

return (dp);
}
Expand Down
81 changes: 70 additions & 11 deletions module/zfs/zfs_vfsops.c
Original file line number Diff line number Diff line change
Expand Up @@ -1072,6 +1072,75 @@ zfs_root(zfs_sb_t *zsb, struct inode **ipp)
}
EXPORT_SYMBOL(zfs_root);

#if !defined(HAVE_SHRINK) && !defined(HAVE_SPLIT_SHRINKER_CALLBACK)
/*
* Linux kernels older than 3.1 do not support a per-filesystem shrinker.
* To accomodate this we must improvise and manually walk the list of znodes

This comment has been minimized.

Copy link
@dweeezil

dweeezil Jun 18, 2015

s/accomodate/accommodate/

* attempting to prune dentries in order to be able to drop the inodes.
*
* To avoid scanning the same znodes multiple times they are always rotated
* to the end of the z_all_znodes list. New znodes are inserted at the
* end of the list so we're always scanning the oldest znodes first.
*/
static int
zfs_sb_prune_compat(zfs_sb_t *zsb, unsigned long nr_to_scan)
{
list_t prune_list;
znode_t *zp;
int objects = 0;
int i = 0;

list_create(&prune_list, sizeof (znode_t),
offsetof(znode_t, z_prune_node));

mutex_enter(&zsb->z_znodes_lock);

This comment has been minimized.

Copy link
@dweeezil

dweeezil Jun 18, 2015

I'm wondering how hotly-contended this lock might become. Among other things, it's needed along the .destroy_inode callback path.

This comment has been minimized.

Copy link
@behlendorf

behlendorf Jun 18, 2015

Author Owner

My guess is not that hot but I've never measured it. It's briefly taken during znode creation, destruction, and fs teardown. It's always been something of a sore point as well since it's technically entirely wasted overhead under Linux. The super block is already maintaining a similar list.

This is one of those things it would be nice to remove someday when refactoring the VFS code. It would let's us shrink the znode by 16 a list_node_t. Similarly the the znode_t itself is highly redundant with a Linux inode. Merging the two would saves us lots of space per-file, this has been on the cleanup list forever. See openzfs#227.

while ((zp = list_head(&zsb->z_all_znodes)) != NULL) {

if (i++ > nr_to_scan)
break;

ASSERT(list_link_active(&zp->z_link_node));
list_remove(&zsb->z_all_znodes, zp);
list_insert_tail(&zsb->z_all_znodes, zp);

This comment has been minimized.

Copy link
@dweeezil

dweeezil Jun 18, 2015

I like this. I've wondered but never looked into whether other shrinkers in the kernel do the same thing, which would lessen the need to call them with ever-higher nr_to_scan values.

This comment has been minimized.

Copy link
@behlendorf

behlendorf Jun 18, 2015

Author Owner

It does and active inode may be rotated to the end up the per-superblock lists. But the specifics vary significantly by kernel version.

if (!mutex_tryenter(&zp->z_lock))
continue;

if (list_link_active(&zp->z_prune_node)) {
mutex_exit(&zp->z_lock);
continue;
}

if (igrab(ZTOI(zp)) == NULL) {
mutex_exit(&zp->z_lock);
continue;
}

list_insert_tail(&prune_list, zp);
mutex_exit(&zp->z_lock);
}
mutex_exit(&zsb->z_znodes_lock);

while ((zp = list_head(&prune_list)) != NULL) {

d_prune_aliases(ZTOI(zp));

if (atomic_read(&ZTOI(zp)->i_count) == 1)
objects++;

mutex_enter(&zp->z_lock);
list_remove(&prune_list, zp);
mutex_exit(&zp->z_lock);

iput(ZTOI(zp));
}

list_destroy(&prune_list);

return (objects);
}
#endif

/*
* The ARC has requested that the filesystem drop entries from the dentry
* and inode caches. This can occur when the ARC needs to free meta data
Expand Down Expand Up @@ -1107,17 +1176,7 @@ zfs_sb_prune(struct super_block *sb, unsigned long nr_to_scan, int *objects)
#elif defined(HAVE_SHRINK)
*objects = (*shrinker->shrink)(shrinker, &sc);
#else
/*
* Linux kernels older than 3.1 do not support a per-filesystem
* shrinker. Therefore, we must fall back to the only available
* interface which is to discard all unused dentries and inodes.
* This behavior clearly isn't ideal but it's required so the ARC
* may free memory. The performance impact is mitigated by the
* fact that the frequently accessed dentry and inode buffers will
* still be in the ARC making them relatively cheap to recreate.
*/
*objects = 0;
shrink_dcache_parent(sb->s_root);
*objects = zfs_sb_prune_compat(zsb, nr_to_scan);
#endif
ZFS_EXIT(zsb);

Expand Down
2 changes: 2 additions & 0 deletions module/zfs/zfs_znode.c
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,7 @@ zfs_znode_cache_constructor(void *buf, void *arg, int kmflags)

inode_init_once(ZTOI(zp));
list_link_init(&zp->z_link_node);
list_link_init(&zp->z_prune_node);

mutex_init(&zp->z_lock, NULL, MUTEX_DEFAULT, NULL);
rw_init(&zp->z_parent_lock, NULL, RW_DEFAULT, NULL);
Expand All @@ -130,6 +131,7 @@ zfs_znode_cache_destructor(void *buf, void *arg)
znode_t *zp = buf;

ASSERT(!list_link_active(&zp->z_link_node));
ASSERT(!list_link_active(&zp->z_prune_node));
mutex_destroy(&zp->z_lock);
rw_destroy(&zp->z_parent_lock);
rw_destroy(&zp->z_name_lock);
Expand Down

1 comment on commit 78aa2b1

@behlendorf
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the speedy review. I'll push a reworked version of this patch shortly after a little additional stress testing.

Please sign in to comment.