-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Git tip can lock arc_no_grow to B_TRUE, resulting in a total ARC size collapse #3637
Comments
@siebenmann thank you for catching this! Yes, this was definitely an oversight and you've already identified the core issue. Under Linux the VM was designed such that we should never need to know how much free memory is available on the system. However, the ARC design from illumos assumes this information is available. To handle this an not diverge too much from illumos the code was originally modified to use a slightly difference mechanism to manage That said, we now do have a mechanism for checking the number of free pages on the system which is available for all the supported kernels. Let me propose a patch which takes advantage of that interface in a way that's appropriate for Linux. |
It occurs to me this might be related to the issues with #3616 - my solution was to raise |
I have hit this one several times already, I've seen ARC fall from 128G to 6G, @behlendorf I'm waiting for the promised patch, I'll have a patch party on Friday, so it'd be cool to have a fix for this included, or at least I can test whatever you come up with :-) |
While Linux doesn't provide detailed information about the state of the VM it does provide us total free pages. This information should be incorporated in to the arc_available_memory() calculation rather than solely relying on a signal from direct reclaim. It is also desirable that the amount of reclaim be tunable on a target system. While the default values are expected to work well for most workloads there may be cases where custom values are needed. zfs_arc_lotsfree - Threshold in bytes for what the ARC should consider to be a lot of free memory on the system. zfs_arc_desfree - Threshold in bytes for what the ARC should consider to be the desired available free memory on the system. Note that zfs_arc_lotsfree and zfs_arc_desfree are defined in terms of bytes unlike the illumos globals lotsfree and desfree. This was done to make reading and setting the values easier. The current values are available in /proc/spl/kstat/zfs/arcstats. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#3637
Pull request #3639 updated with the promised patch. Effectively it updates the ARC to consult the number of free pages on the system in This patch follows in the same spitit of the previous ARC changes by functionally bringing the ZoL ARC back in sync with upstream as much as possible. @nedbass @ryao @siebenmann @DeHackEd @snajpa I;ve only had a chance to lightly test this change so and feedback, review and testing would be highly appreciated. |
While Linux doesn't provide detailed information about the state of the VM it does provide us total free pages. This information should be incorporated in to the arc_available_memory() calculation rather than solely relying on a signal from direct reclaim. It is also desirable that the target amount of free memory be tunable on a system. While the default values are expected to work well for most workloads there may be cases where custom values are needed. The zfs_arc_sys_free module option was added for this purpose. zfs_arc_sys_free - The target number of bytes the ARC should leave as free memory on the system. This value can checked in /proc/spl/kstat/zfs/arcstats and setting this module option will override the default value. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#3637
While Linux doesn't provide detailed information about the state of the VM it does provide us total free pages. This information should be incorporated in to the arc_available_memory() calculation rather than solely relying on a signal from direct reclaim. Conceptually this brings arc_available_memory() back in sync with illumos. It is also desirable that the target amount of free memory be tunable on a system. While the default values are expected to work well for most workloads there may be cases where custom values are needed. The zfs_arc_sys_free module option was added for this purpose. zfs_arc_sys_free - The target number of bytes the ARC should leave as free memory on the system. This value can checked in /proc/spl/kstat/zfs/arcstats and setting this module option will override the default value. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#3637
This brings the behavior of arc_memory_throttle() back in sync with illumos. The update memory throttling policy, as used by illumos, roughly goes like this: * Never throttle if more than 10% of memory is free. This threshold is configurable with the zfs_arc_lotsfree_percent module option. * Minimize any throttling of kswapd even when free memory is below the set threshold. Allow it to write out pages as quickly as possible to help alleviate the memory pressure. * Delay all other threads when free memory is below the set threshold in order to avoid compounding the memory pressure. Buffers will be evicted from the ARC to reduce the issue. The Linux specific zfs_arc_memory_throttle_disable module option has been removed in favor of the existing zfs_arc_lotsfree_percent tuning. Setting zfs_arc_lotsfree_percent=0 will have the same effect as zfs_arc_memory_throttle_disable and it was therefore redundant. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#3637
While Linux doesn't provide detailed information about the state of the VM it does provide us total free pages. This information should be incorporated in to the arc_available_memory() calculation rather than solely relying on a signal from direct reclaim. Conceptually this brings arc_available_memory() back in sync with illumos. It is also desirable that the target amount of free memory be tunable on a system. While the default values are expected to work well for most workloads there may be cases where custom values are needed. The zfs_arc_sys_free module option was added for this purpose. zfs_arc_sys_free - The target number of bytes the ARC should leave as free memory on the system. This value can checked in /proc/spl/kstat/zfs/arcstats and setting this module option will override the default value. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#3637 Signed-off-by: Pavel Snajdr <[email protected]>
This brings the behavior of arc_memory_throttle() back in sync with illumos. The update memory throttling policy, as used by illumos, roughly goes like this: * Never throttle if more than 10% of memory is free. This threshold is configurable with the zfs_arc_lotsfree_percent module option. * Minimize any throttling of kswapd even when free memory is below the set threshold. Allow it to write out pages as quickly as possible to help alleviate the memory pressure. * Delay all other threads when free memory is below the set threshold in order to avoid compounding the memory pressure. Buffers will be evicted from the ARC to reduce the issue. The Linux specific zfs_arc_memory_throttle_disable module option has been removed in favor of the existing zfs_arc_lotsfree_percent tuning. Setting zfs_arc_lotsfree_percent=0 will have the same effect as zfs_arc_memory_throttle_disable and it was therefore redundant. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#3637 Signed-off-by: Pavel Snajdr <[email protected]>
This patch reverts 77ab5dd. This is now possible because upstream has refactored the ARC in such a way that these values are only used in a few key places. Those places have subsequently been updated to use the Linux equivalent Linux functionality. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs/zfs#3637
This brings the behavior of arc_memory_throttle() back in sync with illumos. The updated memory throttling policy roughly goes like this: * Never throttle if more than 10% of memory is free. This threshold is configurable with the zfs_arc_lotsfree_percent module option. * Minimize any throttling of kswapd even when free memory is below the set threshold. Allow it to write out pages as quickly as possible to help alleviate the memory pressure. * Delay all other threads when free memory is below the set threshold in order to avoid compounding the memory pressure. Buffers will be evicted from the ARC to reduce the issue. The Linux specific zfs_arc_memory_throttle_disable module option has been removed in favor of the existing zfs_arc_lotsfree_percent tuning. Setting zfs_arc_lotsfree_percent=0 will have the same effect as zfs_arc_memory_throttle_disable and it was therefore redundant. Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#3637
Someone was just in IRC with the same symptoms. |
Might it be something similar to or related to #3680, where the page cache hammers the ARC into the ground? The symptoms are sort of similar; arc_no_grow goes to 1, the ARC size is hammered into the ground (especially data_size), and then it doesn't grow afterwards even if arc_no_grow becomes 0 again. It's also possible that there's a general issue here where once the ARC has been hammered into the ground by something, it grows only very slowly even if there's lots of free memory. Forcing the ARC target size up with a zfs_arc_max reset then allows the ARC to start growing aggressively. If this is the case then I'd expect it to happen for any surge of memory demand that shoves the ARC down, whether that is from page cache growth, sudden user memory demand, or some other kernel memory usage. |
I've observed a situation where the latest git tip experiences an ARC size collapse despite plentiful free system memory; arc_c flatlined at 32 MB (c_min), arc_no_grow reported at 1, and of course the system performed terribly because nothing was cached. On inspection, I believe that there is an oversight in current git tip (after the ARC sync-up landed) that can result in this.
If I'm reading the code right, the primary point where arc_no_grow is set to B_FALSE is in arc_reclaim_thread(). Tracing through the logic, this happens if free_memory is > (arc_c >> arc_no_grow_shift) and we've waited for growtime. On my 32 GB machine with a c_min of 32 MB, this requires free_memory to be above 1 MB. However, free_memory comes from arc_available_memory(), which on Linux returns at most PAGE_SIZE, ie 4K. As a result, this condition can never be true and arc_no_grow will be permanently locked at B_TRUE.
The core problem here is that on Linux, arc_available_memory() is not a value, it is a signal (and it doesn't look like a particularly good one at that), but arc_reclaim_thread() wants to use it as value. This fails badly.
The text was updated successfully, but these errors were encountered: