Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in 3ec34e55: ARC-related tunable module parameter values are no longer applied at runtime #8405

Closed
jgottula opened this issue Feb 14, 2019 · 10 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang) Type: Regression Indicates a functional regression
Milestone

Comments

@jgottula
Copy link
Contributor

jgottula commented Feb 14, 2019

System information

Type Version/Name
Distribution Name Arch Linux
Distribution Version n/a
Linux Kernel 4.20.7
Architecture x86_64
ZFS Version 0.8.0-rc3_45_gd8d418ff0 d8d418f

Quick summary before the long version

Commit 3ec34e5 (Mar 15 2017) reworked arc_reclaim_thread, and in doing so, seems to have unintentionally removed a call to arc_tuning_update that was necessary for several ARC-tuning-related zfs module parameters to be able to have their changes applied properly at runtime.

This issue has likely affected anyone using the master branch since Mar 15 2017 Dec 26 2018; as well as release 0.8.0-rc3 from Jan 14 2019. (I checked the other release tags, and 0.8.0-rc3 appears to be the first one to have included the arc_reclaim_thread rework from 3ec34e5.)

The problem I encountered

(Edit: Please note that this issue report is not meant to concern this particular case of ARC behavior not working as expected or whatever; the concern is with the fact that a bunch of module parameters do literally nothing when their value is changed, for completely explainable reasons elaborated on elsewhere in this post.)

Today, I noticed that my machine was swapping due to the ARC taking up all 8 GiB of memory I had allotted to it while I was running some high-memory-usage compression tasks. So I decided to decrease zfs_arc_max down from 8 GiB to 6 GiB.

I went to a terminal and did echo $((6*(1<<30))) >/sys/module/zfs/parameters/zfs_arc_max as root. Nothing actually happened. I could read the value back, i.e. cat /sys/module/zfs/parameters/zfs_arc_max now showed 6442450944 instead of 8589934592.

But watching /proc/spl/kstat/zfs/arcstats indicated no actual change to anything.

c        8589934592
c_max    8589934592
size     8589934592 (approx.)

And arc_summary --section arc likewise indicated no actual change to anything.

ARC size (current):         99.9 %    8.0 GiB
Target size (adaptive):    100.0 %    8.0 GiB
Max size (high water):        16:1    8.0 GiB

My investigation into the cause of the problem

I could swear that it used to be possible to change these things at runtime. So I rummaged through the zfs source code to see how the module parameters are applied.

The values of many of the ARC tunable kernel module parameters are not used directly, but are instead used in an indirect manner: arc_tuning_update takes the raw mod param values and assigns them to the variables that the ARC actually looks at directly, clamping the raw values as needed.

Here are all of the module parameters that work in this way:

zfs_arc_max
zfs_arc_min
zfs_arc_meta_min
zfs_arc_meta_limit
zfs_arc_meta_limit_percent
zfs_arc_dnode_limit
zfs_arc_dnode_limit_percent
zfs_arc_grow_retry
zfs_arc_shrink_shift
zfs_arc_p_min_shift
zfs_arc_min_prefetch_ms
zfs_arc_min_prescient_prefetch_ms
zfs_arc_lotsfree_percent
zfs_arc_sys_free

So, a change to /sys/module/zfs/parameters/zfs_arc_max will change the value of zfs_arc_max in module/zfs/arc.c. But zfs_arc_max isn't used by anything; arc_c_max is what actually matters. And zfs_arc_max's value is only actually assigned to arc_c_max in arc_tuning_update.

Now, the documentation (function comment) for arc_tuning_update purports:

/*
 * Called during module initialization and periodically thereafter to
 * apply reasonable changes to the exposed performance tunings.  Non-zero
 * zfs_* values which differ from the currently set values will be applied.
 */

But I could only locate one call to arc_tuning_update in the entire repository, which was in arc_init:

	/* Apply user specified tunings */
	arc_tuning_update();

The "called during module initialization and periodically thereafter" documentation didn't match up to the reality. And it made more sense to me that this was a case of the code accidentally being broken at some time, than a case of the documentation being wrong (given that I remembered it working in the past).

So I looked for any commits in the history that had in any way referenced the word arc_tuning_update, via git log -Garc_tuning_update. This only turned up two commits:

  • The first, ca67b33, was from Jun 26 2015 and featured the creation of the arc_tuning_update function in the first place.
  • The second, 3ec34e5, was the Mar 15 2017 rework of arc_reclaim_thread; and in checking the commit diff, a call to arc_tuning_update was removed from arc_reclaim_thread, but no new call to the function was added in replacement.

Conclusions and what to do about this

So it appears that the arc_reclaim_thread rework accidentally removed the not-at-init-time calls to arc_tuning_update.

Probably one of the two replacement thread functions (arc_adjust_cb or arc_reap_cb) should be making that call. I don't know which should be calling it; or if there's some different way it ought to be done now. But hopefully just raising awareness of the problem is sufficient for you guys to know how to fix it up properly.

How to reproduce the problem

  1. Write a new value into any of the affected zfs module parameters listed earlier in this report, via /sys/module/zfs/parameters/$PARAM.
  2. Observe that the newly written value is reported if you read /sys/module/zfs/parameters/$PARAM back.
  3. But also observe (via /proc/spl/kstat/zfs/arcstats, or arc_summary --section arc, or whatever), that the new value doesn't actually affect the functioning of the ARC itself.

Other issues that could possibly have been caused by this

#6540 #6852

Both of these are from the affected date range. The latter issue seems much more plausibly affected by this than the former does, as it mentioned using the git master code, whereas the other one mentioned using zfs version 0.6.5.6 (albeit with Ubuntu patches, so who the hell knows).

@behlendorf behlendorf added this to the 0.8.0 milestone Feb 14, 2019
@richardelling
Copy link
Contributor

The behaviour of zfs_arc_max is described correctly and behaves as you witnessed according to the doc at
https://github.com/zfsonlinux/zfs/wiki/ZFS-on-Linux-Module-Parameters#zfs_arc_max

@richardelling
Copy link
Contributor

If the goal is to cause the ARC to shrink, then there is a process that can be made using zfs_arc_sys_free There isn't a simple generic way to describe this because the amount of memory varies dramatically, but the process can include the following steps:

  • note the current arc_c and zfs_arc_sys_free values
  • if zfs_arc_sys_free = 0 then calculate the value as documented
  • increase zfs_arc_sys_free to a value that forces the ARC to shrink to the desired level
  • set zfs_arc_max to desired value
  • optionally reset zfs_arc_sys_free to previous value

@jgottula
Copy link
Contributor Author

The behaviour of zfs_arc_max is described correctly and behaves as you witnessed according to the doc at
https://github.com/zfsonlinux/zfs/wiki/ZFS-on-Linux-Module-Parameters#zfs_arc_max

Perhaps the particular example I gave (of decreasing the value of zfs_arc_max) wasn't the best and may have been misleading.

The problem I'm describing in this issue isn't "help, my ARC won't shrink when I reduce zfs_arc_max". The problem I'm describing is "help, a bunch of the ARC-related kernel module parameters have a problem where attempting to change them after module init time does literally nothing, even in cases where changing them should do something".

@jgottula
Copy link
Contributor Author

  • note the current arc_c and zfs_arc_sys_free values
  • if zfs_arc_sys_free = 0 then calculate the value as documented
  • increase zfs_arc_sys_free to a value that forces the ARC to shrink to the desired level
  • set zfs_arc_max to desired value
  • optionally reset zfs_arc_sys_free to previous value

Yeah so, the problem with any of this would be that writing literally any value into /sys/module/zfs/parameters/zfs_arc_sys_free results in literally no change in the actual value of arc_sys_free as reported by /proc/spl/kstat/zfs/arcstats.

Sure, the module parameter value itself changes. But the actual value of the actual variable arc_sys_free in module/zfs/arc.c (well, really, the variable that the macro arc_sys_free refers to, i.e. arc_stats.arc_sys_free.value.ui64) does not change.

Read the issue report again...

@richardelling
Copy link
Contributor

I'm trying to get to a procedural documented expectation of behaviour to help with your bug report. It would help to have use cases that make sense for dynamic tuning and I think you've provided one. If you'd rather not have help, then feel free to submit a PR.

@jgottula
Copy link
Contributor Author

Please try to understand that my bug report really has nothing to do with any particular one dynamic tuning use case not working. My concern was not "yikes, why is the ARC usage not decreasing like I thought it would?"; but rather, "yikes, why is it that when I change (for example) the ARC limit to some value, the ARC limit doesn't actually change to that value, or at all for that matter?".

I'm trying to get to a procedural documented expectation of behaviour to help with your bug report.

To be really pedantic... the expected behavior is that if I modify zfs_arc_max, or zfs_arc_min, or zfs_arc_meta_min, or zfs_arc_meta_limit, or zfs_arc_meta_limit_percent, or zfs_arc_dnode_limit, or zfs_arc_dnode_limit_percent, or zfs_arc_grow_retry, or zfs_arc_shrink_shift, or zfs_arc_p_min_shift, or zfs_arc_min_prefetch_ms, or zfs_arc_min_prescient_prefetch_ms, or zfs_arc_lotsfree_percent, or zfs_arc_sys_free, at any point after zfs module initialization time, that the new value(s) will even make it to the ARC code in the first place to have a chance to affect literally anything.

Here is a comprehensive list of things that should happen, which don't happen:

  • If I write to /sys/module/zfs/parameters/zfs_arc_max, the value of arc_c_max will actually change accordingly*
  • If I write to /sys/module/zfs/parameters/zfs_arc_min, the value of arc_c_min will actually change accordingly*
  • If I write to /sys/module/zfs/parameters/zfs_arc_meta_min, the value of arc_meta_min will actually change accordingly*
  • If I write to /sys/module/zfs/parameters/zfs_arc_meta_limit or /sys/module/zfs/parameters/zfs_arc_meta_limit_percent, the value of arc_meta_limit will actually change accordingly*
  • If I write to /sys/module/zfs/parameters/zfs_arc_dnode_limit or /sys/module/zfs/parameters/zfs_arc_dnode_limit_percent, the value of arc_dnode_limit will actually change accordingly*
  • If I write to /sys/module/zfs/parameters/zfs_arc_grow_retry, the value of arc_grow_retry will actually change accordingly*
  • If I write to /sys/module/zfs/parameters/zfs_arc_shrink_shift, the value of arc_shrink_shift will actually change accordingly*
  • If I write to /sys/module/zfs/parameters/zfs_arc_p_min_shift, the value of arc_p_min_shift will actually change accordingly*
  • If I write to /sys/module/zfs/parameters/zfs_arc_min_prefetch_ms, the value of arc_min_prefetch_ms will actually change accordingly*
  • If I write to /sys/module/zfs/parameters/zfs_arc_min_prescient_prefetch_ms, the value of arc_min_prescient_prefetch_ms will actually change accordingly*
  • If I write to /sys/module/zfs/parameters/zfs_arc_lotsfree_percent, the value of arc_lotsfree_percent will actually change accordingly*
  • If I write to /sys/module/zfs/parameters/zfs_arc_sys_free, the value of arc_sys_free will actually change accordingly*
    ( * subject to clamping and limits and so forth, of course)

The only way any of those module parameters are actually read in and applied to the ARC at all is via the function arc_tuning_update; which in the past was called both at arc_init time and periodically thereafter by the arc_reclaim_thread kthread; but 3ec34e5 seems to have unintentionally removed the kernel thread calls which used to happen after module initialization time.

Hence, all of the module parameters that I listed in the issue report can have their apparent values changed as much as you want via sysfs (at post-module-load-time), but it will have zero effect on the ARC because arc_init has already called the arc_tuning_update function the one time that it will ever be called while the module is loaded.


My own naive attempt at a fix would be to insert a call to arc_tuning_update at the beginning of arc_adjust_cb_check.

But I'm not familiar enough with the specifics here to know for sure whether that's actually the equivalent place to old call site; nor whether this placement has 100% correct behavior WRT locking and so forth (which is complicated stuff that's easy to screw up).

--- a/module/zfs/arc.c
+++ b/module/zfs/arc.c
@@ -5075,37 +5075,39 @@ arc_kmem_reap_soon(void)
 /* ARGSUSED */
 static boolean_t
 arc_adjust_cb_check(void *arg, zthr_t *zthr)
 {
+	arc_tuning_update();
+
 	/*
 	 * This is necessary in order to keep the kstat information
 	 * up to date for tools that display kstat data such as the
 	 * mdb ::arc dcmd and the Linux crash utility.  These tools
 	 * typically do not call kstat's update function, but simply
 	 * dump out stats from the most recent update.  Without
 	 * this call, these commands may show stale stats for the
 	 * anon, mru, mru_ghost, mfu, and mfu_ghost lists. Even
 	 * with this change, the data might be up to 1 second
 	 * out of date(the arc_adjust_zthr has a maximum sleep
 	 * time of 1 second); but that should suffice.  The
 	 * arc_state_t structures can be queried directly if more
 	 * accurate information is needed.
 	 */
 	if (arc_ksp != NULL)
 		arc_ksp->ks_update(arc_ksp, KSTAT_READ);
 
 	/*
 	 * We have to rely on arc_get_data_impl() to tell us when to adjust,
 	 * rather than checking if we are overflowing here, so that we are
 	 * sure to not leave arc_get_data_impl() waiting on
 	 * arc_adjust_waiters_cv.  If we have become "not overflowing" since
 	 * arc_get_data_impl() checked, we need to wake it up.  We could
 	 * broadcast the CV here, but arc_get_data_impl() may have not yet
 	 * gone to sleep.  We would need to use a mutex to ensure that this
 	 * function doesn't broadcast until arc_get_data_impl() has gone to
 	 * sleep (e.g. the arc_adjust_lock).  However, the lock ordering of
 	 * such a lock would necessarily be incorrect with respect to the
 	 * zthr_lock, which is held before this function is called, and is
 	 * held by arc_get_data_impl() when it calls zthr_wakeup().
 	 */
 	return (arc_adjust_needed);
 }

@jwittlincohen
Copy link
Contributor

jwittlincohen commented Feb 21, 2019

I'm seeing the same thing:

After setting the following settings, arc_summary still shows the prior zfs_arc_min and zfs_arc_max settings.

echo "25769803776" > /sys/module/zfs/parameters/zfs_arc_max
echo "17179869184" > /sys/module/zfs/parameters/zfs_arc_min
Target size (adaptive):                       100.0 %    5.0 GiB
        Min size (hard limit):                        19.6 %  1004.7 MiB
        Max size (high water):                            5:1    5.0 GiB

I also tried to set the settings at boot time via /etc/modprobe.d/zfs.conf:

options zfs zfs_arc_max=25769803776
options zfs zfs_arc_min=17179869184

This too failed to change the zfs_arc_min or zfs_arc_max settings.

The only remaining option was to set zfs_arc_max and zfs_arc_min through kernel settings via /etc/default/grub.

GRUB_CMDLINE_LINUX="zfs.zfs_arc_max=25769803776 zfs.zfs_arc_min=17179869184"
To update grub.cfg, run grub-mkconfig /boot/grub/grub.cfg

P.S.: Thanks kpande for the workaround!

@behlendorf
Copy link
Contributor

@jgottula thanks for catching this and isolating the exact commit where it was introduced. You're proposed fix above is the right one, would you mind opening a PR with the change. I'd only suggest that you add a comment above the call to arc_tuning_update(). Something like:

        /*
         * This is necessary in order to periodically propagate changes
         * made to the zfs_arc* module options to the internal counterparts.
         */
        arc_tuning_update();

This issue has likely affected anyone using the master branch since Mar 15 2017

This was actually merged on Dec 26, 2018. This is a change which was originally authored a while ago but was only finalized and committed within the last few months.

commit 3ec34e55271d433e3c2dbb861a886361e006ca0a
Author:     Brad Lewis <[email protected]>
AuthorDate: Wed Mar 15 16:41:52 2017 -0700
Commit:     Brian Behlendorf <[email protected]>
CommitDate: Wed Dec 26 13:22:28 2018 -0800

@behlendorf behlendorf added Type: Regression Indicates a functional regression Type: Defect Incorrect behavior (e.g. crash, hang) labels Feb 21, 2019
@jgottula
Copy link
Contributor Author

jgottula commented Feb 27, 2019

You're proposed fix above is the right one, would you mind opening a PR with the change. I'd only suggest that you add a comment above the call to arc_tuning_update(). Something like:

        /*
         * This is necessary in order to periodically propagate changes
         * made to the zfs_arc* module options to the internal counterparts.
         */
        arc_tuning_update();

Done! #8463

This was actually merged on Dec 26, 2018. This is a change which was originally authored a while ago but was only finalized and committed within the last few months.

commit 3ec34e55271d433e3c2dbb861a886361e006ca0a
Author:     Brad Lewis <[email protected]>
AuthorDate: Wed Mar 15 16:41:52 2017 -0700
Commit:     Brian Behlendorf <[email protected]>
CommitDate: Wed Dec 26 13:22:28 2018 -0800

Ah, good catch. Git date confusion strikes again...

And that makes more sense anyway. I've been using the master branch for quite a while, and it didn't quite seem like the breakage had been affecting me for two entire years...

@dewi-ny-je
Copy link

dewi-ny-je commented Sep 1, 2019

I am running zfs 0.8.0-208_g142f84dd1 and the issue is still present:

# echo 30064771072 >> /sys/module/zfs/parameters/zfs_arc_max
# arc_summary -s arc
------------------------------------------------------------------------
ZFS Subsystem Report                            Sun Sep 01 21:36:14 2019
Linux 4.15.0-51-generic                             0.8.0-208_g142f84dd1
Machine: ml110g7 (x86_64)                           0.8.0-208_g142f84dd1

ARC status:                                                      HEALTHY
        Memory throttle count:                                         0

ARC size (current):                                    89.9 %   14.1 GiB
        Target size (adaptive):                       100.0 %   15.7 GiB
        Min size (hard limit):                         6.2 %  1003.8 MiB
        Max size (high water):                           16:1   15.7 GiB
        Most Frequently Used (MFU) cache size:         28.4 %    3.7 GiB
        Most Recently Used (MRU) cache size:           71.6 %    9.3 GiB
        Metadata cache size (hard limit):              75.0 %   11.8 GiB
        Metadata cache size (current):                 18.2 %    2.1 GiB
        Dnode cache size (hard limit):                 10.0 %    1.2 GiB
        Dnode cache size (current):                    36.6 %  440.6 MiB
# cat /proc/spl/kstat/zfs/arcstats
[...]
p                               4    10253679616
c                               4    16841076736
c_min                           4    1052567296
c_max                           4    16841076736
size                            4    15137031288
[...]

I confirmed that the sources I used to compile zfs comprise the patch approved on 12 March 2019.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang) Type: Regression Indicates a functional regression
Projects
None yet
Development

No branches or pull requests

5 participants