Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify reclaim #1613

Closed
wants to merge 2 commits into from
Closed

Simplify reclaim #1613

wants to merge 2 commits into from

Conversation

ryao
Copy link
Contributor

@ryao ryao commented Jul 26, 2013

I have been testing a revival of arc_reclaim where I make direct reclaim kick the arc_reclaim thread and block until either 1 second passes or it completes. It seems to work well and brings us more in line with other implementations.

I am opening this pull request for review. The manner in which the commits are split needs to be revisited before it is merged, but the end result seems sane to me.

The reverts nearly all of the changes, with some adjustments to
arc_memory_throttle() because some of the original code did not make
sense and changes had been made to it since this patch was written.

Signed-off-by: Richard Yao <[email protected]>

Conflicts:
	module/zfs/arc.c
@ryao
Copy link
Contributor Author

ryao commented Jul 27, 2013

This should close issue #802. It implements what I sketched out there.

@ryao
Copy link
Contributor Author

ryao commented Jul 27, 2013

I seem to have caught a regression:

[ 2824.861365] BUG: unable to handle kernel NULL pointer dereference at           (null)
[ 2824.862001] IP: [<ffffffff812886cb>] __list_add+0x1b/0xc0
[ 2824.862001] PGD 14184b067 PUD 1c943d067 PMD 0 
[ 2824.877181] Oops: 0000 [#1] 
[ 2824.877181] PREEMPT 
[ 2824.877181] SMP 

[ 2824.877181] Modules linked in: bridge stp ipv6 llc snd_hda_codec_analog arc4 rtl8187 eeprom_93cx6 mac80211 coretemp kvm_intel kvm cfg80211 iTCO_wdt snd_hda_intel microcode firewire_ohci snd_hda_codec lpc_ich mfd_core i2c_i801 snd_pcm snd_page_alloc snd_timer rtc_cmos floppy snd acpi_cpufreq asus_atk0110 mperf freq_table evdev soundcore processor unix zfs(PO) zunicode(PO) zavl(PO) zcommon(PO) xts gf128mul aes_x86_64 cbc sha256_generic iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi fuse znvpair(PO) spl(O) xfs nfs lockd sunrpc reiserfs ext4 crc16 mbcache jbd2 firewire_core crc_itu_t hid_generic usbhid uhci_hcd usb_storage ehci_pci ehci_hcd sr_mod cdrom sg pata_jmicron
[ 2824.877181] CPU: 3 PID: 51173 Comm: javac Tainted: P           O 3.10.2 #1
[ 2824.877181] Hardware name: System manufacturer P5K Deluxe/P5K Deluxe, BIOS 1005    12/16/2008
[ 2824.877181] task: ffff88007f7c6180 ti: ffff880178948000 task.ti: ffff880178948000
[ 2824.877181] RIP: 0010:[<ffffffff812886cb>]  [<ffffffff812886cb>] __list_add+0x1b/0xc0
[ 2824.877181] RSP: 0000:ffff8801789499d8  EFLAGS: 00010046
[ 2824.877181] RAX: 0000000000000202 RBX: ffff880178949a50 RCX: 0000000000000100
[ 2824.877181] RDX: ffffffffa0434620 RSI: 0000000000000000 RDI: ffff880178949a50
[ 2824.877181] RBP: ffff8801789499f0 R08: 0000000000000000 R09: 0000000000000000
[ 2824.877181] R10: 0000000000015ab9 R11: 0000000000000000 R12: ffffffffa0434620
[ 2824.877181] R13: 0000000000000000 R14: ffffffffa0434608 R15: 00000000000f1806
[ 2824.877181] FS:  00007fef1c64e700(0000) GS:ffff88022fd80000(0000) knlGS:0000000000000000
[ 2824.877181] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2824.877181] CR2: 0000000000000000 CR3: 00000001ee5b0000 CR4: 00000000000407e0
[ 2824.877181] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2824.877181] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 2824.877181] Stack:
[ 2824.877181]  ffff880178949a50 ffffffffa0434608 0000000000000001 ffff880178949a28
[ 2824.877181]  ffffffff8105fce3 0000000000000202 00000000a04346e8 ffffffffa0434600
[ 2824.877181]  ffffffffa0434680 00000000000003e8 ffff880178949a80 ffffffffa01299dd
[ 2824.877181] Call Trace:
[ 2824.877181]  [<ffffffff8105fce3>] prepare_to_wait_exclusive+0x73/0x80
[ 2824.877181]  [<ffffffffa01299dd>] __cv_wait_io+0x9d/0x150 [spl]
[ 2824.877181]  [<ffffffff8105fea0>] ? wake_up_bit+0x30/0x30
[ 2824.877181]  [<ffffffffa0129aae>] __cv_timedwait_interruptible+0xe/0x10 [spl]
[ 2824.877181]  [<ffffffffa0373642>] arc_add_prune_callback+0x5e2/0x620 [zfs]
[ 2824.877181]  [<ffffffff810f566d>] shrink_slab+0x17d/0x3a0
[ 2824.877181]  [<ffffffff810f7c69>] do_try_to_free_pages+0x1d9/0x470
[ 2824.877181]  [<ffffffff810f7fc9>] try_to_free_pages+0xc9/0x1e0
[ 2824.877181]  [<ffffffff810ee5a5>] __alloc_pages_nodemask+0x4f5/0x880
[ 2824.877181]  [<ffffffff8112f3cb>] do_huge_pmd_anonymous_page+0x17b/0x440
[ 2824.877181]  [<ffffffff8110b59b>] handle_mm_fault+0x24b/0x2e0
[ 2824.877181]  [<ffffffff8156b504>] __do_page_fault+0x154/0x5b0
[ 2824.877181]  [<ffffffff8106fe1d>] ? sched_clock_local+0x1d/0x80
[ 2824.877181]  [<ffffffff810c2c27>] ? acct_account_cputime+0x17/0x20
[ 2824.877181]  [<ffffffff810703f5>] ? account_user_time+0x85/0x90
[ 2824.877181]  [<ffffffff815681a3>] ? _raw_spin_unlock+0x13/0x40
[ 2824.877181]  [<ffffffff81070881>] ? vtime_account_user+0x61/0x70
[ 2824.877181]  [<ffffffff8156b987>] do_page_fault+0x27/0x50
[ 2824.877181]  [<ffffffff81568ae2>] page_fault+0x22/0x30
[ 2824.877181] Code: e9 3b ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 90 55 48 89 e5 41 55 49 89 f5 41 54 49 89 d4 53 48 89 fb 4c 8b 42 08 49 39 f0 75 2a <4d> 8b 45 00 4d 39 c4 75 68 4c 39 e3 74 3e 4c 39 eb 74 39 49 89 
[ 2824.877181] RIP  [<ffffffff812886cb>] __list_add+0x1b/0xc0
[ 2824.877181]  RSP <ffff8801789499d8>
[ 2824.877181] CR2: 0000000000000000
[ 2824.877181] ---[ end trace a39eaa29a5fbfbd0 ]---
[ 2824.877181] BUG: sleeping function called from invalid context at kernel/rwsem.c:20
[ 2824.877181] in_atomic(): 1, irqs_disabled(): 1, pid: 51173, name: javac
[ 2824.877181] CPU: 3 PID: 51173 Comm: javac Tainted: P      D    O 3.10.2 #1
[ 2824.877181] Hardware name: System manufacturer P5K Deluxe/P5K Deluxe, BIOS 1005    12/16/2008
[ 2824.877181]  ffff880178949928 ffff880178949660 ffffffff81562f07 ffff880178949670
[ 2824.877181]  ffffffff81068eb1 ffff880178949688 ffffffff81565b5b ffff88007f7c6180
[ 2824.877181]  ffff8801789496a8 ffffffff8104fcef ffff880178949928 0000000000000009
[ 2824.877181] Call Trace:
[ 2824.877181]  [<ffffffff81562f07>] dump_stack+0x19/0x1b
[ 2824.877181]  [<ffffffff81068eb1>] __might_sleep+0xe1/0x100
[ 2824.877181]  [<ffffffff81565b5b>] down_read+0x1b/0x30
[ 2824.877181]  [<ffffffff8104fcef>] exit_signals+0x1f/0x120
[ 2824.877181]  [<ffffffff8103f2f7>] do_exit+0xa7/0x9f0
[ 2824.877181]  [<ffffffff8155d491>] ? printk+0x4f/0x51
[ 2824.877181]  [<ffffffff8103e1e9>] ? kmsg_dump+0xb9/0xd0
[ 2824.877181]  [<ffffffff815695eb>] oops_end+0x8b/0xd0
[ 2824.877181]  [<ffffffff8155ce28>] no_context+0x25e/0x26b
[ 2824.877181]  [<ffffffff81009894>] ? native_sched_clock+0x24/0x80
[ 2824.877181]  [<ffffffff8155ce9d>] __bad_area_nosemaphore+0x68/0x1c1
[ 2824.877181]  [<ffffffff8106ffa8>] ? sched_clock_cpu+0xa8/0x100
[ 2824.877181]  [<ffffffff8155d004>] bad_area_nosemaphore+0xe/0x10
[ 2824.877181]  [<ffffffff8156b70e>] __do_page_fault+0x35e/0x5b0
[ 2824.877181]  [<ffffffff81071df1>] ? select_task_rq_fair+0x201/0x6e0
[ 2824.877181]  [<ffffffff81009894>] ? native_sched_clock+0x24/0x80
[ 2824.877181]  [<ffffffff8106854a>] ? ttwu_stat+0x9a/0x110
[ 2824.877181]  [<ffffffff8156b987>] do_page_fault+0x27/0x50
[ 2824.877181]  [<ffffffff81568ae2>] page_fault+0x22/0x30
[ 2824.877181]  [<ffffffff812886cb>] ? __list_add+0x1b/0xc0
[ 2824.877181]  [<ffffffff8105fce3>] prepare_to_wait_exclusive+0x73/0x80
[ 2824.877181]  [<ffffffffa01299dd>] __cv_wait_io+0x9d/0x150 [spl]
[ 2824.877181]  [<ffffffff8105fea0>] ? wake_up_bit+0x30/0x30
[ 2824.877181]  [<ffffffffa0129aae>] __cv_timedwait_interruptible+0xe/0x10 [spl]
[ 2824.877181]  [<ffffffffa0373642>] arc_add_prune_callback+0x5e2/0x620 [zfs]
[ 2824.877181]  [<ffffffff810f566d>] shrink_slab+0x17d/0x3a0
[ 2824.877181]  [<ffffffff810f7c69>] do_try_to_free_pages+0x1d9/0x470
[ 2824.877181]  [<ffffffff810f7fc9>] try_to_free_pages+0xc9/0x1e0
[ 2824.877181]  [<ffffffff810ee5a5>] __alloc_pages_nodemask+0x4f5/0x880
[ 2824.877181]  [<ffffffff8112f3cb>] do_huge_pmd_anonymous_page+0x17b/0x440
[ 2824.877181]  [<ffffffff8110b59b>] handle_mm_fault+0x24b/0x2e0
[ 2824.877181]  [<ffffffff8156b504>] __do_page_fault+0x154/0x5b0
[ 2824.877181]  [<ffffffff8106fe1d>] ? sched_clock_local+0x1d/0x80
[ 2824.877181]  [<ffffffff810c2c27>] ? acct_account_cputime+0x17/0x20
[ 2824.877181]  [<ffffffff810703f5>] ? account_user_time+0x85/0x90
[ 2824.877181]  [<ffffffff815681a3>] ? _raw_spin_unlock+0x13/0x40
[ 2824.877181]  [<ffffffff81070881>] ? vtime_account_user+0x61/0x70
[ 2824.877181]  [<ffffffff8156b987>] do_page_fault+0x27/0x50
[ 2824.877181]  [<ffffffff81568ae2>] page_fault+0x22/0x30
[ 2824.882542] note: javac[51173] exited with preempt_count 1

I am somewhat confused as to how arc_add_prune_callback is being listed in the backtrace. I believe that the problem is that the arc_shrinker is being invoked from the page fault handler. I am testing a revised patch that should address it. If all goes well, I will push it tomorrow.

On Illumos, kswapd and arc_reclaim_thread share responsibility for
memory reclaimation. arc_reclaim_thread will attempt to keep ahead of
system memory needs and kswapd will signal it as needed. On Linux, a
thread that needs memory when there is none available will immediately
begin "direct reclaim", which means that it will try freeing memory
itself.

This poses a few problems. First, it is possible for hundreds of threads
to enter direct reclaim. Unfortunately, only one can reclaim from ARC at
a time. This means that one will reap while the others will be told to
do other things. If we enter a state where ARC is the only source of
memory available on the system, then we effectively have a fork bomb
where all of the other reclaim threads will starve reclaimation from ARC
of CPU time.

Additionally, it is possible for ARC and Linux to have different ideas
of when a system is low on memory. If Linux thinks it is out of memory
and ARC disagrees, we will have a situation where arc_reclaim_thread
spins.

This patch attempts to address the first issue by signalling
arc_reclaim_thread to enter reclaim and blocking until
eitheri signalled by arc_reclaim_thread 1 second passes. Resuming after
1 second is necessary to resolve a race between arc_reclaim_thread doing
its broadcast and sleeping. A previous patch had empirical success in
resolving the second by moving primary responsibility for ARC memory
reclaimation to the ARC shrinker callback. This patch moves all ARC
memory reclaimation duties to arc_reclaim_thread, which should have a
similar effect.

Signed-off-by: Richard Yao <[email protected]>
@ryao
Copy link
Contributor Author

ryao commented Jul 27, 2013

I have pushed a revised commit that will return immediately when in an interrupt context to address the backtrace above. It will also return -1 instead of the amount reclaimed to ensure that shrink_slab() does not continuously retry the shrinker. This is consistent with openzfs/spl#268.

@ryao
Copy link
Contributor Author

ryao commented Jul 29, 2013

This needs some more work. I am closing this until I have perfected it.

@ryao ryao closed this Jul 29, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant