Skip to content
This repository has been archived by the owner on Feb 26, 2020. It is now read-only.

Emergency slab objects #155

Closed
wants to merge 1 commit into from

Conversation

behlendorf
Copy link
Contributor

This patch is designed to resolve a deadlock which can occur with
__vmalloc() based slabs. The issue is that the Linux kernel does
not honor the flags passed to __vmalloc(). This makes it unsafe
to use in a writeback context. Unfortunately, this is a use case
ZFS depends on for correct operation.

Fixing this issue in the upstream kernel was pursued and patches
are available which resolve the issue.

https://bugs.gentoo.org/show_bug.cgi?id=416685

However, these changes were rejected because upstream felt that
using __vmalloc() in the context of writeback should never be done.
Their solution was for us to rewrite parts of ZFS to accomidate
the Linux VM.

While that is probably the right long term solution, and it is
something we want to pursue, it is not a trivial task and will
likely destabilize the existing code. This work has been planned
for the 0.7.0 release but in the meanwhile we want to improve the
SPL slab implementation to accomidate this expected ZFS usage.

This is accomplished by performing the __vmalloc() asynchronously
in the context of a work queue. This doesn't prevent the posibility
of the worker thread from deadlocking. However, the caller can now
safely block on a wait queue for the slab allocation to complete.

Normally this will occur in a reasonable amount of time and the
caller will be woken up when the new slab is available,. The objects
will then get cached in the per-cpu magazines and everything will
proceed as usual.

However, if the __vmalloc() deadlocks for the reasons described
above, or is just very slow, then the callers on the wait queues
will timeout out. When this rare situation occurs they will attempt
to kmalloc() a single minimally sized object using the GFP_NOIO flags.
This allocation will not deadlock because kmalloc() will honor the
passed flags and the caller will be able to make forward progress.

As long as forward progress can be maintained then even if the
worker thread is deadlocked the txg_sync thread will make progress.
This will eventually allow the deadlocked worked thread to complete
and normal operation will resume.

These emergency allocations will likely be slow since they require
contiguous pages. However, their use should be rare so the impact
is expected to be minimal. If that turns out not to be the case in
practice further optimizations are possible.

One additional concern is if these emergency objects are long lived.
Right now they are simply tracked on a list which must be walked when
an object is freed. Is they accumulate on a system and the list
grows freeing objects will become more expensive. This could be
handled relatively easily by using a hash instead of a list, but that
optimization is left for a follow up patch.

Additionally, there emeregency objects could be repacked in to existing
slabs as objects are freed if the kmem_cache_set_move() functionality
was implemented. See issue #26
for full details. This work would also help reduce ZFS's memory
fragmentation problems.

The /proc/spl/kmem/slab file has had two new columns added at the
end. The 'emerg' column reports the current number of these emergency
objects in use for the cache, and the following 'max' column shows
the historical worst case. These value should give us a good idea
of how often these objects are needed. Based on these values under
real use cases we can tune the default behavior.

Lastly, as a side benefit using a single work queue for the slab
allocations should reduce cpu contention on the global virtual address
space lock. This should manifest itself as reduced cpu usage for
the system.

Signed-off-by: Brian Behlendorf [email protected]

This patch is designed to resolve a deadlock which can occur with
__vmalloc() based slabs.  The issue is that the Linux kernel does
not honor the flags passed to __vmalloc().  This makes it unsafe
to use in a writeback context.  Unfortunately, this is a use case
ZFS depends on for correct operation.

Fixing this issue in the upstream kernel was pursued and patches
are available which resolve the issue.

  https://bugs.gentoo.org/show_bug.cgi?id=416685

However, these changes were rejected because upstream felt that
using __vmalloc() in the context of writeback should never be done.
Their solution was for us to rewrite parts of ZFS to accomidate
the Linux VM.

While that is probably the right long term solution, and it is
something we want to pursue, it is not a trivial task and will
likely destabilize the existing code.  This work has been planned
for the 0.7.0 release but in the meanwhile we want to improve the
SPL slab implementation to accomidate this expected ZFS usage.

This is accomplished by performing the __vmalloc() asynchronously
in the context of a work queue.  This doesn't prevent the posibility
of the worker thread from deadlocking.  However, the caller can now
safely block on a wait queue for the slab allocation to complete.

Normally this will occur in a reasonable amount of time and the
caller will be woken up when the new slab is available,.  The objects
will then get cached in the per-cpu magazines and everything will
proceed as usual.

However, if the __vmalloc() deadlocks for the reasons described
above, or is just very slow, then the callers on the wait queues
will timeout out.  When this rare situation occurs they will attempt
to kmalloc() a single minimally sized object using the GFP_NOIO flags.
This allocation will not deadlock because kmalloc() will honor the
passed flags and the caller will be able to make forward progress.

As long as forward progress can be maintained then even if the
worker thread is deadlocked the txg_sync thread will make progress.
This will eventually allow the deadlocked worked thread to complete
and normal operation will resume.

These emergency allocations will likely be slow since they require
contiguous pages.  However, their use should be rare so the impact
is expected to be minimal.  If that turns out not to be the case in
practice further optimizations are possible.

One additional concern is if these emergency objects are long lived.
Right now they are simply tracked on a list which must be walked when
an object is freed.  Is they accumulate on a system and the list
grows freeing objects will become more expensive.  This could be
handled relatively easily by using a hash instead of a list, but that
optimization is left for a follow up patch.

Additionally, there emeregency objects could be repacked in to existing
slabs as objects are freed if the kmem_cache_set_move() functionality
was implemented.  See issue openzfs#26
for full details.  This work would also help reduce ZFS's memory
fragmentation problems.

The /proc/spl/kmem/slab file has had two new columns added at the
end.  The 'emerg' column reports the current number of these emergency
objects in use for the cache, and the following 'max' column shows
the historical worst case.  These value should give us a good idea
of how often these objects are needed.  Based on these values under
real use cases we can tune the default behavior.

Lastly, as a side benefit using a single work queue for the slab
allocations should reduce cpu contention on the global virtual address
space lock.   This should manifest itself as reduced cpu usage for
the system.

Signed-off-by: Brian Behlendorf <[email protected]>
@behlendorf
Copy link
Contributor Author

I'm hopeful that with this patch applied to the SPL and the patches from openzfs/zfs#726 applied to ZFS we'll get the needed improvements is stability under memory pressure. If anyone has time to perform some additional testing with these patches I'd appreciate it.

@ryao
Copy link
Contributor

ryao commented Aug 17, 2012

This is a noticeable improvement over the current code. I ran four instances of python -c "print 2**10**10" on my quadcore system and it was able to reach approximately 1.1GB in swap before it deadlocked. I am going to guess that memory fragmentation prevented the emergency slabs from being allocated under stress, but this is noticeably better than the current code, which deadlocks around 100MB. The kernel also appeared to begin writing to the swap zvol earlier than it does with the current code, which is another good thing.

@behlendorf
Copy link
Contributor Author

That's encouraging. I'll try your python -c "print 2**10**10" test case in a VM and see if I can recreate the issue.

As for it being caused by fragmentation I'm not so sure. The emergency objects are all allocated GFP_NOIO so we should keep reclaiming all memory which doesn't require an I/O until we can satisfy the allocation. I'd hope we'd be able to eventually find the 128k we need... although it could take a while.

I think it's more likely that there are still some kmem_alloc(KM_SLEEP) allocations in certain unlikely code paths which need to be changed to kmem_alloc(KM_PUSHPAGE). That is unless you were running with your previous patch which changed a lot of these call sites. Anyway, we'll need to get some backtraces to be sure.

With this change applied I've also so far had good luck running with the ZFS PF_MEMALLOC patches reverted as expected.

@ryao
Copy link
Contributor

ryao commented Aug 18, 2012

I think it's more likely that there are still some kmem_alloc(KM_SLEEP) allocations in certain unlikely code paths which need to be changed to kmem_alloc(KM_PUSHPAGE). That is unless you were running with your previous patch which changed a lot of these call sites. Anyway, we'll need to get some backtraces to be sure.

I was using this with that patch.

@pyavdr
Copy link

pyavdr commented Aug 19, 2012

I am not sure if it matters, but there seems to be an improvement to the SLAB called SLUB: http://lwn.net/Articles/229984/ ... Maybe some ideas of SLUB can help to improve the current situation?

@behlendorf
Copy link
Contributor Author

@ryao It looks like your right about fragmentation still being an issue. I was able to reproduce the same issue with your python test case in my VM when using zfs for the swap device. The following order 5 allocation was for an emergency object during swap. I'm going to investigate if there is a safe way to allocate individual pages for these objects and then safely vmap() them. This is something you looked in to in the past, but I'm hopeful we can just target this sort of behavior for the rarely used emergency objects.

[  273.786719] __alloc_pages_slowpath: 15 callbacks suppressed
[  273.787075] z_wr_int/11: page allocation failure. order:5, mode:0x4030
[  273.787510] Pid: 2066, comm: z_wr_int/11 Tainted: P            2.6.38-15-generic #65-Ubuntu
[  273.788059] Call Trace:
[  273.788247]  [] ? __alloc_pages_nodemask+0x604/0x840
[  273.788695]  [] ? alloc_pages_current+0xa5/0x110
[  273.789094]  [] ? __get_free_pages+0xe/0x50
[  273.789501]  [] ? kmalloc_order_trace+0x3f/0xb0
[  273.789906]  [] ? __kmalloc+0x13a/0x160
[  273.797911]  [] ? spl_kmem_cache_alloc+0x9d8/0x12b0 [spl]
[  273.798385]  [] ? __vdev_disk_physio+0x3ae/0x4d0 [zfs]
[  273.799881]  [] ? autoremove_wake_function+0x0/0x40
[  273.800305]  [] ? ftrace_call+0x5/0x2b
[  273.801749]  [] ? zio_buf_alloc+0x31/0x80 [zfs]
[  273.802157]  [] ? vdev_queue_io_to_issue+0x3f6/0x7e0 [zfs]
[  273.802597]  [] ? vdev_queue_io_done+0x82/0xd0 [zfs]
[  273.805707]  [] ? zio_vdev_io_done+0x98/0x1f0 [zfs]
[  273.806137]  [] ? zio_execute+0xfa/0x310 [zfs]
[  273.806561]  [] ? ftrace_call+0x5/0x2b
[  273.806913]  [] ? taskq_thread+0x263/0x860 [spl]
[  273.807332]  [] ? finish_task_switch+0x41/0xe0
[  273.807751]  [] ? default_wake_function+0x0/0x20
[  273.808150]  [] ? taskq_thread+0x0/0x860 [spl]
[  273.808575]  [] ? kthread+0x96/0xa0
[  273.808913]  [] ? kernel_thread_helper+0x4/0x10
[  273.809325]  [] ? kthread+0x0/0xa0
[  273.809672]  [] ? kernel_thread_helper+0x0/0x10

@behlendorf
Copy link
Contributor Author

Closing. See issue #161 and openzfs/zfs#883 which take this work farther and allow for zvol based swap devices.

@behlendorf behlendorf closed this Aug 23, 2012
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants