Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Git tip can lock arc_no_grow to B_TRUE, resulting in a total ARC size collapse #3637

Closed
siebenmann opened this issue Jul 27, 2015 · 6 comments
Labels
Component: Memory Management kernel memory management
Milestone

Comments

@siebenmann
Copy link
Contributor

I've observed a situation where the latest git tip experiences an ARC size collapse despite plentiful free system memory; arc_c flatlined at 32 MB (c_min), arc_no_grow reported at 1, and of course the system performed terribly because nothing was cached. On inspection, I believe that there is an oversight in current git tip (after the ARC sync-up landed) that can result in this.

If I'm reading the code right, the primary point where arc_no_grow is set to B_FALSE is in arc_reclaim_thread(). Tracing through the logic, this happens if free_memory is > (arc_c >> arc_no_grow_shift) and we've waited for growtime. On my 32 GB machine with a c_min of 32 MB, this requires free_memory to be above 1 MB. However, free_memory comes from arc_available_memory(), which on Linux returns at most PAGE_SIZE, ie 4K. As a result, this condition can never be true and arc_no_grow will be permanently locked at B_TRUE.

The core problem here is that on Linux, arc_available_memory() is not a value, it is a signal (and it doesn't look like a particularly good one at that), but arc_reclaim_thread() wants to use it as value. This fails badly.

@behlendorf
Copy link
Contributor

@siebenmann thank you for catching this! Yes, this was definitely an oversight and you've already identified the core issue. Under Linux the VM was designed such that we should never need to know how much free memory is available on the system. However, the ARC design from illumos assumes this information is available.

To handle this an not diverge too much from illumos the code was originally modified to use a slightly difference mechanism to manage arc_no_grow. The idea is that we should rely on the __arc_shrinker_func() to determine if memory is low on the system. That function will be called by the Linux VM regularly if memory is low. It can be used to reset the arc_grow_time and that can be checked in arc_available_memory() to make a reasonable determination about the memory state of the system. If a sufficient amount of time has passed without the shrinker being called we can safely assume there is sufficient available memory on the system. The exact value shouldn't have been too important since the kernel will notify us when it's low. However, as you discovered it as and I overlooked this case when doing the merge.

That said, we now do have a mechanism for checking the number of free pages on the system which is available for all the supported kernels. Let me propose a patch which takes advantage of that interface in a way that's appropriate for Linux.

@behlendorf behlendorf added this to the 0.6.5 milestone Jul 27, 2015
@DeHackEd
Copy link
Contributor

It occurs to me this might be related to the issues with #3616 - my solution was to raise zfs_arc_min but I didn't realize something might have been related. Stuck at low values the ARC kept trying to shrink itself and stalling.

@snajpa
Copy link
Contributor

snajpa commented Jul 27, 2015

I have hit this one several times already, I've seen ARC fall from 128G to 6G, arc_no_grow=1 and the IOPS performance of ARC dropped from ~450k down to 40k (mostly thanks to L2ARC).

@behlendorf I'm waiting for the promised patch, I'll have a patch party on Friday, so it'd be cool to have a fix for this included, or at least I can test whatever you come up with :-)

behlendorf added a commit to behlendorf/zfs that referenced this issue Jul 27, 2015
While Linux doesn't provide detailed information about the state of
the VM it does provide us total free pages.  This information should
be incorporated in to the arc_available_memory() calculation rather
than solely relying on a signal from direct reclaim.

It is also desirable that the amount of reclaim be tunable on a
target system.  While the default values are expected to work well
for most workloads there may be cases where custom values are needed.

zfs_arc_lotsfree - Threshold in bytes for what the ARC should consider
                   to be a lot of free memory on the system.

zfs_arc_desfree  - Threshold in bytes for what the ARC should consider
                   to be the desired available free memory on the system.

Note that zfs_arc_lotsfree and zfs_arc_desfree are defined in terms
of bytes unlike the illumos globals lotsfree and desfree.  This was
done to make reading and setting the values easier.  The current values
are available in /proc/spl/kstat/zfs/arcstats.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#3637
@behlendorf
Copy link
Contributor

Pull request #3639 updated with the promised patch. Effectively it updates the ARC to consult the number of free pages on the system in arc_available_memory() more like the illumos code. However, because we don't have the same global tuning exposed as illumos I've added the zfs_arc_lotsfree and zfs_arc_desfree tunables which are analogous to their illumos counterparts (except in bytes not pages). They default to the same values are are visible in /proc/spl/kstat/zfs/arcstats.

This patch follows in the same spitit of the previous ARC changes by functionally bringing the ZoL ARC back in sync with upstream as much as possible.

@nedbass @ryao @siebenmann @DeHackEd @snajpa I;ve only had a chance to lightly test this change so and feedback, review and testing would be highly appreciated.

behlendorf added a commit to behlendorf/zfs that referenced this issue Jul 28, 2015
While Linux doesn't provide detailed information about the state of
the VM it does provide us total free pages.  This information should
be incorporated in to the arc_available_memory() calculation rather
than solely relying on a signal from direct reclaim.

It is also desirable that the target amount of free memory be tunable
on a system.  While the default values are expected to work well
for most workloads there may be cases where custom values are needed.
The zfs_arc_sys_free module option was added for this purpose.

zfs_arc_sys_free - The target number of bytes the ARC should leave
                   as free memory on the system.  This value can
                   checked in /proc/spl/kstat/zfs/arcstats and
                   setting this module option will override the
                   default value.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#3637
behlendorf added a commit to behlendorf/zfs that referenced this issue Jul 28, 2015
While Linux doesn't provide detailed information about the state of
the VM it does provide us total free pages.  This information should
be incorporated in to the arc_available_memory() calculation rather
than solely relying on a signal from direct reclaim.  Conceptually
this brings arc_available_memory() back in sync with illumos.

It is also desirable that the target amount of free memory be tunable
on a system.  While the default values are expected to work well
for most workloads there may be cases where custom values are needed.
The zfs_arc_sys_free module option was added for this purpose.

zfs_arc_sys_free - The target number of bytes the ARC should leave
                   as free memory on the system.  This value can
                   checked in /proc/spl/kstat/zfs/arcstats and
                   setting this module option will override the
                   default value.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#3637
behlendorf added a commit to behlendorf/zfs that referenced this issue Jul 28, 2015
This brings the behavior of arc_memory_throttle() back in sync with
illumos.  The update memory throttling policy, as used by illumos,
roughly goes like this:

* Never throttle if more than 10% of memory is free.  This threshold
  is configurable with the zfs_arc_lotsfree_percent module option.

* Minimize any throttling of kswapd even when free memory is below
  the set threshold.  Allow it to write out pages as quickly as
  possible to help alleviate the memory pressure.

* Delay all other threads when free memory is below the set threshold
  in order to avoid compounding the memory pressure.  Buffers will be
  evicted from the ARC to reduce the issue.

The Linux specific zfs_arc_memory_throttle_disable module option has
been removed in favor of the existing zfs_arc_lotsfree_percent tuning.
Setting zfs_arc_lotsfree_percent=0 will have the same effect as
zfs_arc_memory_throttle_disable and it was therefore redundant.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#3637
snajpa pushed a commit to vpsfreecz/zfs that referenced this issue Jul 29, 2015
While Linux doesn't provide detailed information about the state of
the VM it does provide us total free pages.  This information should
be incorporated in to the arc_available_memory() calculation rather
than solely relying on a signal from direct reclaim.  Conceptually
this brings arc_available_memory() back in sync with illumos.

It is also desirable that the target amount of free memory be tunable
on a system.  While the default values are expected to work well
for most workloads there may be cases where custom values are needed.
The zfs_arc_sys_free module option was added for this purpose.

zfs_arc_sys_free - The target number of bytes the ARC should leave
                   as free memory on the system.  This value can
                   checked in /proc/spl/kstat/zfs/arcstats and
                   setting this module option will override the
                   default value.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#3637
Signed-off-by: Pavel Snajdr <[email protected]>
snajpa pushed a commit to vpsfreecz/zfs that referenced this issue Jul 29, 2015
This brings the behavior of arc_memory_throttle() back in sync with
illumos.  The update memory throttling policy, as used by illumos,
roughly goes like this:

* Never throttle if more than 10% of memory is free.  This threshold
  is configurable with the zfs_arc_lotsfree_percent module option.

* Minimize any throttling of kswapd even when free memory is below
  the set threshold.  Allow it to write out pages as quickly as
  possible to help alleviate the memory pressure.

* Delay all other threads when free memory is below the set threshold
  in order to avoid compounding the memory pressure.  Buffers will be
  evicted from the ARC to reduce the issue.

The Linux specific zfs_arc_memory_throttle_disable module option has
been removed in favor of the existing zfs_arc_lotsfree_percent tuning.
Setting zfs_arc_lotsfree_percent=0 will have the same effect as
zfs_arc_memory_throttle_disable and it was therefore redundant.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#3637
Signed-off-by: Pavel Snajdr <[email protected]>
behlendorf added a commit to openzfs/spl that referenced this issue Jul 30, 2015
This patch reverts 77ab5dd.  This is now possible because upstream has
refactored the ARC in such a way that these values are only used in a
few key places.  Those places have subsequently been updated to use
the Linux equivalent Linux functionality.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs/zfs#3637
behlendorf added a commit to behlendorf/zfs that referenced this issue Jul 30, 2015
This brings the behavior of arc_memory_throttle() back in sync with
illumos.  The updated memory throttling policy roughly goes like this:

* Never throttle if more than 10% of memory is free.  This threshold
  is configurable with the zfs_arc_lotsfree_percent module option.

* Minimize any throttling of kswapd even when free memory is below
  the set threshold.  Allow it to write out pages as quickly as
  possible to help alleviate the memory pressure.

* Delay all other threads when free memory is below the set threshold
  in order to avoid compounding the memory pressure.  Buffers will be
  evicted from the ARC to reduce the issue.

The Linux specific zfs_arc_memory_throttle_disable module option has
been removed in favor of the existing zfs_arc_lotsfree_percent tuning.
Setting zfs_arc_lotsfree_percent=0 will have the same effect as
zfs_arc_memory_throttle_disable and it was therefore redundant.

Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#3637
@DeHackEd
Copy link
Contributor

Someone was just in IRC with the same symptoms.
http://paste.ubuntu.com/12052761/
He appeared to be up to date as of right now, including kernel driver version. Forcibly raising zfs_arc_max unwedged him.

@siebenmann
Copy link
Contributor Author

Might it be something similar to or related to #3680, where the page cache hammers the ARC into the ground? The symptoms are sort of similar; arc_no_grow goes to 1, the ARC size is hammered into the ground (especially data_size), and then it doesn't grow afterwards even if arc_no_grow becomes 0 again.

It's also possible that there's a general issue here where once the ARC has been hammered into the ground by something, it grows only very slowly even if there's lots of free memory. Forcing the ARC target size up with a zfs_arc_max reset then allows the ARC to start growing aggressively. If this is the case then I'd expect it to happen for any surge of memory demand that shoves the ARC down, whether that is from page cache growth, sudden user memory demand, or some other kernel memory usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Memory Management kernel memory management
Projects
None yet
Development

No branches or pull requests

4 participants