[GIT PULL] RCU changes for v5.17 #14

ammarfaizi2 · 2022-01-10T19:02:10Z

Hello, Linus,

Please pull the latest RCU git tree from:

  git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git tags/rcu.2022.01.09a
  # HEAD: f80fe66c38d561a006fb4f514b0ee5d11cbe2673 Merge branches 'doc.2021.11.30c', 'exp.2021.12.07a', 'fastnohz.2021.11.30c', 'fixes.2021.11.30c', 'nocb.2021.12.09a', 'nolibc.2021.11.30c', 'tasks.2021.12.09a', 'torture.2021.12.07a' and 'torturescript.2021.11.30c' into HEAD

RCU changes for this cycle were:

doc.2021.11.30c: Documentation updates, perhaps most notably Neil Brown's
	writeup of the reference-counting analogy to RCU.

exp.2021.12.07a: Expedited grace-period cleanups.

fastnohz.2021.11.30c: Remove CONFIG_RCU_FAST_NO_HZ due to lack of valid
	users. I have asked around, posted a blog entry, and sent this
	series to LKML without result.

fixes.2021.11.30c: Miscellaneous fixes.

nocb.2021.12.09a: RCU callback offloading updates, perhaps most notably
	Frederic Weisbecker's updates allowing CPUs booted in the
	de-offloaded state to be offloaded at runtime.

nolibc.2021.11.30c: nolibc fixes from Willy Tarreau and Anmar Faizi, but
	also including Mark Brown's addition of gettid().

tasks.2021.12.09a: RCU Tasks Trace fixes, including changes that increase
	the scalability of call_rcu_tasks_trace() for the BPF folks
	(Martin Lau and KP Singh).

torture.2021.12.07a: Various fixes including those from Wander Lairson
	Costa and Li Zhijian.

torturescript.2021.11.30c: Fixes plus addition of tests for the increased
	call_rcu_tasks_trace() scalability.

----------------------------------------------------------------
Akira Yokosawa (1):
      doc: RCU: Avoid 'Symbol' font-family in SVG figures

Ammar Faizi (3):
      tools/nolibc: x86-64: Fix startup code bug
      tools/nolibc: x86: Remove `r8`, `r9` and `r10` from the clobber list
      tools/nolibc: x86-64: Use `mov $60,%eax` instead of `mov $60,%rax`

Changbin Du (1):
      rcu: in_irq() cleanup

Chun-Hung Tseng (1):
      rcu: Replace ________p1 and _________p1 with __UNIQUE_ID(rcu)

Frederic Weisbecker (20):
      rcu: Ignore rdp.cpu_no_qs.b.exp on preemptible RCU's rcu_qs()
      rcu: Move rcu_data.cpu_no_qs.b.exp reset to rcu_export_exp_rdp()
      rcu: Remove rcu_data.exp_deferred_qs and convert to rcu_data.cpu no_qs.b.exp
      rcu/exp: Mark current CPU as exp-QS in IPI loop second pass
      rcu/nocb: Make local rcu_nocb_lock_irqsave() safe against concurrent deoffloading
      rcu/nocb: Prepare state machine for a new step
      rcu/nocb: Invoke rcu_core() at the start of deoffloading
      rcu/nocb: Make rcu_core() callbacks acceleration (de-)offloading safe
      rcu/nocb: Check a stable offloaded state to manipulate qlen_last_fqs_check
      rcu/nocb: Use appropriate rcu_nocb_lock_irqsave()
      rcu/nocb: Limit number of softirq callbacks only on softirq
      rcu: Fix callbacks processing time limit retaining cond_resched()
      rcu: Apply callbacks processing time limit only on softirq
      rcu/nocb: Don't invoke local rcu core on callback overload from nocb kthread
      rcu/nocb: Remove rcu_node structure from nocb list when de-offloaded
      rcu/nocb: Prepare nocb_cb_wait() to start with a non-offloaded rdp
      rcu/nocb: Optimize kthreads and rdp initialization
      rcu/nocb: Create kthreads on all CPUs if "rcu_nocbs=" or "nohz_full=" are passed
      rcu/nocb: Allow empty "rcu_nocbs" kernel parameter
      rcu/nocb: Merge rcu_spawn_cpu_nocb_kthread() and rcu_spawn_one_nocb_kthread()

Jun Miao (1):
      rcu: Avoid alloc_pages() when recording stack

Li Zhijian (9):
      refscale: Simplify the errexit checkpoint
      refscale: Prevent buffer to pr_alert() being too long
      refscale: Always log the error message
      refscale: Add missing '\n' to flush message
      scftorture: Add missing '\n' to flush message
      scftorture: Remove unused SCFTORTOUT
      rcuscale: Always log error message
      scftorture: Always log error message
      locktorture,rcutorture,torture: Always log error message

Mark Brown (1):
      tools/nolibc: Implement gettid()

Neeraj Upadhyay (1):
      rcu-tasks: Inspect stalled task's trc state in locked state

NeilBrown (1):
      doc: Add refcount analogy to What is RCU

Paul E. McKenney (43):
      rcutorture: Add CONFIG_PREEMPT_DYNAMIC=n to tiny scenarios
      doc: Remove obsolete kernel-per-CPU-kthreads RCU_FAST_NO_HZ advice
      torture: Remove RCU_FAST_NO_HZ from rcuscale and refscale scenarios
      torture: Remove RCU_FAST_NO_HZ from rcu scenarios
      rcu: Remove the RCU_FAST_NO_HZ Kconfig option
      rcu: Move rcu_needs_cpu() to tree.c
      srcu: Prevent redundant __srcu_read_unlock() wakeup
      rcu-tasks: Don't remove tasks with pending IPIs from holdout list
      rcutorture: Sanitize RCUTORTURE_RDR_MASK
      rcutorture: More thoroughly test nested readers
      rcutorture: Suppress pi-lock-across read-unlock testing for Tiny SRCU
      torture: Catch kvm.sh help text up with actual options
      torture: Make kvm-find-errors.sh report link-time undefined symbols
      torture: Retry download once before giving up
      rcutorture: Cause TREE02 and TREE10 scenarios to do more callback flooding
      rcutorture: Test RCU Tasks lock-contention detection
      torture: Fix incorrectly redirected "exit" in kvm-remote.sh
      torture: Properly redirect kvm-remote.sh "echo" commands
      rcu: Mark sync_sched_exp_online_cleanup() ->cpu_no_qs.b.exp load
      rcu: Prevent expedited GP from enabling tick on offline CPU
      rcu: Make idle entry report expedited quiescent states
      rcu: Tighten rcu_advance_cbs_nowake() checks
      rcu-tasks: Create per-CPU callback lists
      rcu-tasks: Introduce ->percpu_enqueue_shift for dynamic queue selection
      rcu-tasks: Convert grace-period counter to grace-period sequence number
      rcu_tasks: Convert bespoke callback list to rcu_segcblist structure
      rcutorture: Test RCU-tasks multiqueue callback queueing
      rcutorture: Enable multiple concurrent callback-flood kthreads
      rcutorture: Add ability to limit callback-flood intensity
      rcutorture: Combine n_max_cbs from all kthreads in a callback flood
      rcu-tasks: Use spin_lock_rcu_node() and friends
      rcu-tasks: Add a ->percpu_enqueue_lim to the rcu_tasks structure
      rcu-tasks: Abstract checking of callback lists
      rcu-tasks: Abstract invocations of callbacks
      rcu-tasks: Use workqueues for multiple rcu_tasks_invoke_cbs() invocations
      rcu-tasks: Make rcu_barrier_tasks*() handle multiple callback queues
      rcu-tasks: Add rcupdate.rcu_task_enqueue_lim to set initial queueing
      rcu-tasks: Count trylocks to estimate call_rcu_tasks() contention
      rcu-tasks: Avoid raw-spinlocked wakeups from call_rcu_tasks_generic()
      rcu-tasks: Use more callback queues if contention encountered
      rcu-tasks: Use separate ->percpu_dequeue_lim for callback dequeueing
      rcu-tasks: Use fewer callbacks queues if callback flood ends
      Merge branches 'doc.2021.11.30c', 'exp.2021.12.07a', 'fastnohz.2021.11.30c', 'fixes.2021.11.30c', 'nocb.2021.12.09a', 'nolibc.2021.11.30c', 'tasks.2021.12.09a', 'torture.2021.12.07a' and 'torturescript.2021.11.30c' into HEAD

Thomas Gleixner (1):
      rcu/nocb: Make rcu_core() callbacks acceleration preempt-safe

Wander Lairson Costa (1):
      rcutorture: Avoid soft lockup during cpu stall

Willy Tarreau (2):
      tools/nolibc: i386: fix initial stack alignment
      tools/nolibc: fix incorrect truncation of exit code

Zhouyi Zhou (1):
      rcu: Improve tree_plugin.h comments and add code cleanups

Zqiang (1):
      rcu: Avoid running boost kthreads on isolated CPUs

 .../RCU/Design/Expedited-Grace-Periods/Funnel0.svg |   4 +-
 .../RCU/Design/Expedited-Grace-Periods/Funnel1.svg |   4 +-
 .../RCU/Design/Expedited-Grace-Periods/Funnel2.svg |   4 +-
 .../RCU/Design/Expedited-Grace-Periods/Funnel3.svg |   4 +-
 .../RCU/Design/Expedited-Grace-Periods/Funnel4.svg |   4 +-
 .../RCU/Design/Expedited-Grace-Periods/Funnel5.svg |   4 +-
 .../RCU/Design/Expedited-Grace-Periods/Funnel6.svg |   4 +-
 .../RCU/Design/Expedited-Grace-Periods/Funnel7.svg |   4 +-
 .../RCU/Design/Expedited-Grace-Periods/Funnel8.svg |   4 +-
 .../Design/Requirements/GPpartitionReaders1.svg    |  36 +-
 .../Design/Requirements/ReadersPartitionGP1.svg    |  62 +--
 Documentation/RCU/stallwarn.rst                    |  11 -
 Documentation/RCU/whatisRCU.rst                    |  90 +++-
 Documentation/admin-guide/kernel-parameters.txt    |  70 ++-
 .../admin-guide/kernel-per-CPU-kthreads.rst        |   2 +-
 Documentation/timers/no_hz.rst                     |  10 +-
 include/linux/rcu_segcblist.h                      |  51 ++-
 include/linux/rcupdate.h                           |  50 ++-
 include/linux/rcutiny.h                            |   2 +-
 include/linux/srcu.h                               |   3 +-
 include/linux/torture.h                            |   9 +-
 kernel/locking/locktorture.c                       |   4 +-
 kernel/rcu/Kconfig                                 |  20 +-
 kernel/rcu/rcu_segcblist.c                         |  10 +-
 kernel/rcu/rcu_segcblist.h                         |  12 +-
 kernel/rcu/rcuscale.c                              |  14 +-
 kernel/rcu/rcutorture.c                            | 234 +++++++---
 kernel/rcu/refscale.c                              |  50 ++-
 kernel/rcu/srcutiny.c                              |   2 +-
 kernel/rcu/tasks.h                                 | 476 +++++++++++++++++----
 kernel/rcu/tree.c                                  | 131 ++++--
 kernel/rcu/tree.h                                  |  31 +-
 kernel/rcu/tree_exp.h                              |  14 +-
 kernel/rcu/tree_nocb.h                             | 160 ++++---
 kernel/rcu/tree_plugin.h                           | 250 ++---------
 kernel/rcu/tree_stall.h                            |  27 +-
 kernel/scftorture.c                                |  16 +-
 kernel/torture.c                                   |   4 +-
 tools/include/nolibc/nolibc.h                      |  86 ++--
 .../selftests/rcutorture/bin/kvm-find-errors.sh    |   4 +-
 .../selftests/rcutorture/bin/kvm-recheck-rcu.sh    |   2 +-
 .../testing/selftests/rcutorture/bin/kvm-remote.sh |  23 +-
 tools/testing/selftests/rcutorture/bin/kvm.sh      |   9 +-
 .../selftests/rcutorture/bin/parse-build.sh        |   3 +-
 .../selftests/rcutorture/configs/rcu/SRCU-T        |   1 +
 .../selftests/rcutorture/configs/rcu/SRCU-U        |   1 +
 .../selftests/rcutorture/configs/rcu/TASKS01.boot  |   1 +
 .../selftests/rcutorture/configs/rcu/TINY01        |   1 +
 .../selftests/rcutorture/configs/rcu/TINY02        |   1 +
 .../selftests/rcutorture/configs/rcu/TRACE01.boot  |   1 +
 .../selftests/rcutorture/configs/rcu/TRACE02.boot  |   1 +
 .../selftests/rcutorture/configs/rcu/TREE01        |   1 -
 .../selftests/rcutorture/configs/rcu/TREE02        |   1 -
 .../selftests/rcutorture/configs/rcu/TREE02.boot   |   1 +
 .../selftests/rcutorture/configs/rcu/TREE04        |   1 -
 .../selftests/rcutorture/configs/rcu/TREE05        |   1 -
 .../selftests/rcutorture/configs/rcu/TREE06        |   1 -
 .../selftests/rcutorture/configs/rcu/TREE07        |   1 -
 .../selftests/rcutorture/configs/rcu/TREE08        |   1 -
 .../selftests/rcutorture/configs/rcu/TREE10        |   1 -
 .../selftests/rcutorture/configs/rcu/TREE10.boot   |   1 +
 .../selftests/rcutorture/configs/rcuscale/TINY     |   2 +-
 .../selftests/rcutorture/configs/rcuscale/TRACE01  |   1 -
 .../selftests/rcutorture/configs/rcuscale/TREE     |   1 -
 .../selftests/rcutorture/configs/rcuscale/TREE54   |   1 -
 .../rcutorture/configs/refscale/NOPREEMPT          |   1 -
 .../selftests/rcutorture/configs/refscale/PREEMPT  |   1 -
 .../selftests/rcutorture/doc/TREE_RCU-kconfig.txt  |   1 -
 68 files changed, 1244 insertions(+), 795 deletions(-)
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE02.boot
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE10.boot

With CONFIG_PREEMPT_DYNAMIC=y, the kernel builds with CONFIG_PREEMPTION=y because preemption can be enabled at runtime. This prevents any tests of Tiny RCU or Tiny SRCU from running correctly. This commit therefore explicitly sets CONFIG_PREEMPT_DYNAMIC=n for those scenarios. Signed-off-by: Paul E. McKenney <[email protected]>

This document advises building with both CONFIG_NO_HZ=y and CONFIG_RCU_FAST_NO_HZ=y. However, CONFIG_NO_HZ=y offloads callbacks from all nohz_full CPUs, and CPUs with offloaded callbacks do not benefit from CONFIG_RCU_FAST_NO_HZ=y. Quite the opposite: CONFIG_RCU_FAST_NO_HZ=y simply adds a bit of idle entry/exit overhead. This commit therefore changes that advice to only CONFIG_NO_HZ=y. Signed-off-by: Paul E. McKenney <[email protected]>

The reader-writer-lock analogy is a useful way to think about RCU, but it is not always applicable. It is useful to have other analogies to work with, and particularly to emphasise that no single analogy is perfect. This patch add a "RCU as reference count" to the "what is RCU" document. See https://lwn.net/Articles/872559/ [ paulmck: Apply Akira Yokosawa feedback. ] Reviewed-by: Akira Yokosawa <[email protected]> Signed-off-by: NeilBrown <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>

On Ubuntu Focal, strings in some of SVG files under Documentation/RCU/Design can not be rendered properly when converted to PDF. Ubuntu releases since Focal and Debian bullseye have trouble with "Symbol" font-family in SVG files. As those strings are mostly API names such as "READ_ONCE()", "WRITE_ONCE(), "rcu_read_lock()", and so on, using a generic monospace font-family should be a good alternative. Substitute the font-family name by a simple sed pattern: 's/Symbol/monospace/g' Signed-off-by: Akira Yokosawa <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>

All of the rcuscale and refscale scenarios that mention the Kconfig option CONFIG_RCU_FAST_NO_HZ disable it. But this Kconfig option is disabled by default, so this commit removes the pointless "CONFIG_RCU_FAST_NO_HZ=n" lines from these scenarios. Signed-off-by: Paul E. McKenney <[email protected]>

All of the rcu scenarios that mentioning CONFIG_RCU_FAST_NO_HZ disable it. But this Kconfig option is disabled by default, so this commit removes the pointless "CONFIG_RCU_FAST_NO_HZ=n" lines from these scenarios. Signed-off-by: Paul E. McKenney <[email protected]>

All of the uses of CONFIG_RCU_FAST_NO_HZ=y that I have seen involve systems with RCU callbacks offloaded. In this situation, all that this Kconfig option does is slow down idle entry/exit with an additional allways-taken early exit. If this is the only use case, then this Kconfig option nothing but an attractive nuisance that needs to go away. This commit therefore removes the RCU_FAST_NO_HZ Kconfig option. Signed-off-by: Paul E. McKenney <[email protected]>

Now that RCU_FAST_NO_HZ is no more, there is but one implementation of the rcu_needs_cpu() function. This commit therefore moves this function from kernel/rcu/tree_plugin.c to kernel/rcu/tree.c. Signed-off-by: Paul E. McKenney <[email protected]>

This commit replaces both ________p1 and _________p1 with __UNIQUE_ID(rcu), and also adjusts the callers of the affected macros. __UNIQUE_ID(rcu) will generate unique variable names during compilation, which eliminates the need of ________p1 and _________p1 (both having 4 occurrences prior to the code change). This also avoids the variable name shadowing issue, or at least makes those wishing to cause shadowing problems work much harder to do so. The same idea is used for the min/max macros (commit 589a978 and commit e9092d0). Signed-off-by: Jim Huang <[email protected]> Signed-off-by: Chun-Hung Tseng <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>

This commit replaces the obsolete and ambiguous macro in_irq() with its shiny new in_hardirq() equivalent. Signed-off-by: Changbin Du <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>

This commit cleans up some comments and code in kernel/rcu/tree_plugin.h. Signed-off-by: Zhouyi Zhou <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>

When the boost kthreads are created on systems with nohz_full CPUs, the cpus_allowed_ptr is set to housekeeping_cpumask(HK_FLAG_KTHREAD). However, when the rcu_boost_kthread_setaffinity() is called, the original affinity will be changed and these kthreads can subsequently run on nohz_full CPUs. This commit makes rcu_boost_kthread_setaffinity() restrict these boost kthreads to housekeeping CPUs. Signed-off-by: Zqiang <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>

The default kasan_record_aux_stack() calls stack_depot_save() with GFP_NOWAIT, which in turn can then call alloc_pages(GFP_NOWAIT, ...). In general, however, it is not even possible to use either GFP_ATOMIC nor GFP_NOWAIT in certain non-preemptive contexts/RT kernel including raw_spin_locks (see gfp.h and ab00db2). Fix it by instructing stackdepot to not expand stack storage via alloc_pages() in case it runs out by using kasan_record_aux_stack_noalloc(). Jianwei Hu reported: BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:969 in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 15319, name: python3 INFO: lockdep is turned off. irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffff856c8b13>] copy_process+0xaf3/0x2590 softirqs last enabled at (0): [<ffffffff856c8b13>] copy_process+0xaf3/0x2590 softirqs last disabled at (0): [<0000000000000000>] 0x0 CPU: 6 PID: 15319 Comm: python3 Tainted: G W O 5.15-rc7-preempt-rt #1 Hardware name: Supermicro SYS-E300-9A-8C/A2SDi-8C-HLN4F, BIOS 1.1b 12/17/2018 Call Trace: show_stack+0x52/0x58 dump_stack+0xa1/0xd6 ___might_sleep.cold+0x11c/0x12d rt_spin_lock+0x3f/0xc0 rmqueue+0x100/0x1460 rmqueue+0x100/0x1460 mark_usage+0x1a0/0x1a0 ftrace_graph_ret_addr+0x2a/0xb0 rmqueue_pcplist.constprop.0+0x6a0/0x6a0 __kasan_check_read+0x11/0x20 __zone_watermark_ok+0x114/0x270 get_page_from_freelist+0x148/0x630 is_module_text_address+0x32/0xa0 __alloc_pages_nodemask+0x2f6/0x790 __alloc_pages_slowpath.constprop.0+0x12d0/0x12d0 create_prof_cpu_mask+0x30/0x30 alloc_pages_current+0xb1/0x150 stack_depot_save+0x39f/0x490 kasan_save_stack+0x42/0x50 kasan_save_stack+0x23/0x50 kasan_record_aux_stack+0xa9/0xc0 __call_rcu+0xff/0x9c0 call_rcu+0xe/0x10 put_object+0x53/0x70 __delete_object+0x7b/0x90 kmemleak_free+0x46/0x70 slab_free_freelist_hook+0xb4/0x160 kfree+0xe5/0x420 kfree_const+0x17/0x30 kobject_cleanup+0xaa/0x230 kobject_put+0x76/0x90 netdev_queue_update_kobjects+0x17d/0x1f0 ... ... ksys_write+0xd9/0x180 __x64_sys_write+0x42/0x50 do_syscall_64+0x38/0x50 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Links: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/include/linux/kasan.h?id=7cb3007ce2da27ec02a1a3211941e7fe6875b642 Fixes: 84109ab ("rcu: Record kvfree_call_rcu() call stack for KASAN") Fixes: 26e760c ("rcu: kasan: record and print call_rcu() call stack") Reported-by: Jianwei Hu <[email protected]> Reviewed-by: Uladzislau Rezki (Sony) <[email protected]> Acked-by: Marco Elver <[email protected]> Tested-by: Juri Lelli <[email protected]> Signed-off-by: Jun Miao <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>

Before this patch, the `_start` function looks like this: ``` 0000000000001170 <_start>: 1170: pop %rdi 1171: mov %rsp,%rsi 1174: lea 0x8(%rsi,%rdi,8),%rdx 1179: and $0xfffffffffffffff0,%rsp 117d: sub $0x8,%rsp 1181: call 1000 <main> 1186: movzbq %al,%rdi 118a: mov $0x3c,%rax 1191: syscall 1193: hlt 1194: data16 cs nopw 0x0(%rax,%rax,1) 119f: nop ``` Note the "and" to %rsp with $-16, it makes the %rsp be 16-byte aligned, but then there is a "sub" with $0x8 which makes the %rsp no longer 16-byte aligned, then it calls main. That's the bug! What actually the x86-64 System V ABI mandates is that right before the "call", the %rsp must be 16-byte aligned, not after the "call". So the "sub" with $0x8 here breaks the alignment. Remove it. An example where this rule matters is when the callee needs to align its stack at 16-byte for aligned move instruction, like `movdqa` and `movaps`. If the callee can't align its stack properly, it will result in segmentation fault. x86-64 System V ABI also mandates the deepest stack frame should be zero. Just to be safe, let's zero the %rbp on startup as the content of %rbp may be unspecified when the program starts. Now it looks like this: ``` 0000000000001170 <_start>: 1170: pop %rdi 1171: mov %rsp,%rsi 1174: lea 0x8(%rsi,%rdi,8),%rdx 1179: xor %ebp,%ebp # zero the %rbp 117b: and $0xfffffffffffffff0,%rsp # align the %rsp 117f: call 1000 <main> 1184: movzbq %al,%rdi 1188: mov $0x3c,%rax 118f: syscall 1191: hlt 1192: data16 cs nopw 0x0(%rax,%rax,1) 119d: nopl (%rax) ``` Cc: Bedirhan KURT <[email protected]> Cc: Louvian Lyndal <[email protected]> Reported-by: Peter Cordes <[email protected]> Signed-off-by: Ammar Faizi <[email protected]> [wt: I did this on purpose due to a misunderstanding of the spec, other archs will thus have to be rechecked, particularly i386] Cc: [email protected] Signed-off-by: Willy Tarreau <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>

After re-checking in the spec and comparing stack offsets with glibc, The last pushed argument must be 16-byte aligned (i.e. aligned before the call) so that in the callee esp+4 is multiple of 16, so the principle is the 32-bit equivalent to what Ammar fixed for x86_64. It's possible that 32-bit code using SSE2 or MMX could have been affected. In addition the frame pointer ought to be zero at the deepest level. Link: https://gitlab.com/x86-psABIs/i386-ABI/-/wikis/Intel386-psABI Cc: Ammar Faizi <[email protected]> Cc: [email protected] Signed-off-by: Willy Tarreau <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>

Ammar Faizi reported that our exit code handling is wrong. We truncate it to the lowest 8 bits but the syscall itself is expected to take a regular 32-bit signed integer, not an unsigned char. It's the kernel that later truncates it to the lowest 8 bits. The difference is visible in strace, where the program below used to show exit(255) instead of exit(-1): int main(void) { return -1; } This patch applies the fix to all archs. x86_64, i386, arm64, armv7 and mips were all tested and confirmed to work fine now. Risc-v was not tested but the change is trivial and exactly the same as for other archs. Reported-by: Ammar Faizi <[email protected]> Cc: [email protected] Signed-off-by: Willy Tarreau <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>

Linux x86-64 syscall only clobbers rax, rcx and r11 (and "memory"). - rax for the return value. - rcx to save the return address. - r11 to save the rflags. Other registers are preserved. Having r8, r9 and r10 in the syscall clobber list is harmless, but this results in a missed-optimization. As the syscall doesn't clobber r8-r10, GCC should be allowed to reuse their value after the syscall returns to userspace. But since they are in the clobber list, GCC will always miss this opportunity. Remove them from the x86-64 syscall clobber list to help GCC generate better code and fix the comment. See also the x86-64 ABI, section A.2 AMD64 Linux Kernel Conventions, A.2.1 Calling Conventions [1]. Extra note: Some people may think it does not really give a benefit to remove r8, r9 and r10 from the syscall clobber list because the impression of syscall is a C function call, and function call always clobbers those 3. However, that is not the case for nolibc.h, because we have a potential to inline the "syscall" instruction (which its opcode is "0f 05") to the user functions. All syscalls in the nolibc.h are written as a static function with inline ASM and are likely always inline if we use optimization flag, so this is a profit not to have r8, r9 and r10 in the clobber list. Here is the example where this matters. Consider the following C code: ``` #include "tools/include/nolibc/nolibc.h" #define read_abc(a, b, c) __asm__ volatile("nop"::"r"(a),"r"(b),"r"(c)) int main(void) { int a = 0xaa; int b = 0xbb; int c = 0xcc; read_abc(a, b, c); write(1, "test\n", 5); read_abc(a, b, c); return 0; } ``` Compile with: gcc -Os test.c -o test -nostdlib With r8, r9, r10 in the clobber list, GCC generates this: 0000000000001000 <main>: 1000: f3 0f 1e fa endbr64 1004: 41 54 push %r12 1006: 41 bc cc 00 00 00 mov $0xcc,%r12d 100c: 55 push %rbp 100d: bd bb 00 00 00 mov $0xbb,%ebp 1012: 53 push %rbx 1013: bb aa 00 00 00 mov $0xaa,%ebx 1018: 90 nop 1019: b8 01 00 00 00 mov $0x1,%eax 101e: bf 01 00 00 00 mov $0x1,%edi 1023: ba 05 00 00 00 mov $0x5,%edx 1028: 48 8d 35 d1 0f 00 00 lea 0xfd1(%rip),%rsi 102f: 0f 05 syscall 1031: 90 nop 1032: 31 c0 xor %eax,%eax 1034: 5b pop %rbx 1035: 5d pop %rbp 1036: 41 5c pop %r12 1038: c3 ret GCC thinks that syscall will clobber r8, r9, r10. So it spills 0xaa, 0xbb and 0xcc to callee saved registers (r12, rbp and rbx). This is clearly extra memory access and extra stack size for preserving them. But syscall does not actually clobber them, so this is a missed optimization. Now without r8, r9, r10 in the clobber list, GCC generates better code: 0000000000001000 <main>: 1000: f3 0f 1e fa endbr64 1004: 41 b8 aa 00 00 00 mov $0xaa,%r8d 100a: 41 b9 bb 00 00 00 mov $0xbb,%r9d 1010: 41 ba cc 00 00 00 mov $0xcc,%r10d 1016: 90 nop 1017: b8 01 00 00 00 mov $0x1,%eax 101c: bf 01 00 00 00 mov $0x1,%edi 1021: ba 05 00 00 00 mov $0x5,%edx 1026: 48 8d 35 d3 0f 00 00 lea 0xfd3(%rip),%rsi 102d: 0f 05 syscall 102f: 90 nop 1030: 31 c0 xor %eax,%eax 1032: c3 ret Cc: Andy Lutomirski <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: [email protected] Cc: "H. Peter Anvin" <[email protected]> Cc: David Laight <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Signed-off-by: Ammar Faizi <[email protected]> Link: https://gitlab.com/x86-psABIs/x86-64-ABI/-/wikis/x86-64-psABI [1] Link: https://lore.kernel.org/lkml/[email protected]/ Signed-off-by: Willy Tarreau <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>

Note that mov to 32-bit register will zero extend to 64-bit register. Thus `mov $60,%eax` has the same effect with `mov $60,%rax`. Use the shorter opcode to achieve the same thing. ``` b8 3c 00 00 00 mov $60,%eax (5 bytes) [1] 48 c7 c0 3c 00 00 00 mov $60,%rax (7 bytes) [2] ``` Currently, we use [2]. Change it to [1] for shorter code. Signed-off-by: Ammar Faizi <[email protected]> Signed-off-by: Willy Tarreau <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>

Allow test programs to determine their thread ID. Signed-off-by: Mark Brown <[email protected]> Cc: Willy Tarreau <[email protected]> Signed-off-by: Willy Tarreau <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>

Tiny SRCU readers can appear at task level, but also in interrupt and softirq handlers. Because Tiny SRCU is selected only in kernels built with CONFIG_SMP=n and CONFIG_PREEMPTION=n, it is not possible for a grace period to start while there is a non-task-level SRCU reader executing. This means that it does not make sense for __srcu_read_unlock() to awaken the Tiny SRCU grace period, because that can only happen when the grace period is waiting for one value of ->srcu_idx and __srcu_read_unlock() is ending the last reader for some other value of ->srcu_idx. After all, any such wakeup will be redundant. Worse yet, in some cases, such wakeups generate lockdep splats: ====================================================== WARNING: possible circular locking dependency detected 5.15.0-rc1+ #3758 Not tainted ------------------------------------------------------ rcu_torture_rea/53 is trying to acquire lock: ffffffff9514e6a8 (srcu_ctl.srcu_wq.lock){..-.}-{2:2}, at: xa/0x30 but task is already holding lock: ffff95c642479d80 (&p->pi_lock){-.-.}-{2:2}, at: _extend+0x370/0x400 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&p->pi_lock){-.-.}-{2:2}: _raw_spin_lock_irqsave+0x2f/0x50 try_to_wake_up+0x50/0x580 swake_up_locked.part.7+0xe/0x30 swake_up_one+0x22/0x30 rcutorture_one_extend+0x1b6/0x400 rcu_torture_one_read+0x290/0x5d0 rcu_torture_timer+0x1a/0x70 call_timer_fn+0xa6/0x230 run_timer_softirq+0x493/0x4c0 __do_softirq+0xc0/0x371 irq_exit+0x73/0x90 sysvec_apic_timer_interrupt+0x63/0x80 asm_sysvec_apic_timer_interrupt+0x12/0x20 default_idle+0xb/0x10 default_idle_call+0x5e/0x170 do_idle+0x18a/0x1f0 cpu_startup_entry+0xa/0x10 start_kernel+0x678/0x69f secondary_startup_64_no_verify+0xc2/0xcb -> #0 (srcu_ctl.srcu_wq.lock){..-.}-{2:2}: __lock_acquire+0x130c/0x2440 lock_acquire+0xc2/0x270 _raw_spin_lock_irqsave+0x2f/0x50 swake_up_one+0xa/0x30 rcutorture_one_extend+0x387/0x400 rcu_torture_one_read+0x290/0x5d0 rcu_torture_reader+0xac/0x200 kthread+0x12d/0x150 ret_from_fork+0x22/0x30 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&p->pi_lock); lock(srcu_ctl.srcu_wq.lock); lock(&p->pi_lock); lock(srcu_ctl.srcu_wq.lock); *** DEADLOCK *** 1 lock held by rcu_torture_rea/53: #0: ffff95c642479d80 (&p->pi_lock){-.-.}-{2:2}, at: _extend+0x370/0x400 stack backtrace: CPU: 0 PID: 53 Comm: rcu_torture_rea Not tainted 5.15.0-rc1+ Hardware name: Red Hat KVM/RHEL-AV, BIOS e_el8.5.0+746+bbd5d70c 04/01/2014 Call Trace: check_noncircular+0xfe/0x110 ? find_held_lock+0x2d/0x90 __lock_acquire+0x130c/0x2440 lock_acquire+0xc2/0x270 ? swake_up_one+0xa/0x30 ? find_held_lock+0x72/0x90 _raw_spin_lock_irqsave+0x2f/0x50 ? swake_up_one+0xa/0x30 swake_up_one+0xa/0x30 rcutorture_one_extend+0x387/0x400 rcu_torture_one_read+0x290/0x5d0 rcu_torture_reader+0xac/0x200 ? rcutorture_oom_notify+0xf0/0xf0 ? __kthread_parkme+0x61/0x90 ? rcu_torture_one_read+0x5d0/0x5d0 kthread+0x12d/0x150 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x22/0x30 This is a false positive because there is only one CPU, and both locks are raw (non-preemptible) spinlocks. However, it is worthwhile getting rid of the redundant wakeup, which has the side effect of breaking the theoretical deadlock cycle. This commit therefore eliminates the redundant wakeups. Signed-off-by: Paul E. McKenney <[email protected]>

Currently, the check_all_holdout_tasks_trace() function removes all tasks marked with ->trc_reader_checked from the holdout list, including those with IPIs pending. This means that the IPI handler might arrive at a task that has already been removed from the list, which is at best an accident waiting to happen. This commit therefore avoids removing tasks with IPIs pending from the holdout list. This in turn means that the "if" condition in the for_each_online_cpu() loop in rcu_tasks_trace_postgp() should always evaluate to false, so a WARN_ON_ONCE() is added to check that. Signed-off-by: Paul E. McKenney <[email protected]>

RCUTORTURE_RDR_MASK is currently not the bit indicated by RCUTORTURE_RDR_SHIFT, but is instead all the bits less significant than that one. This is an accident waiting to happen, so this commit makes RCUTORTURE_RDR_MASK be that one bit and adjusts uses accordingly. Signed-off-by: Paul E. McKenney <[email protected]>

Currently, nested readers occur only when a timer handler interrupts a reader. This is rare, and is thus insufficient testing of the transition between nesting levels. This commit therefore causes rcutorture nested readers to be the rule rather than the exception. Signed-off-by: Paul E. McKenney <[email protected]>

Because Tiny srcu_read_unlock() directly calls swake_up_one(), lockdep complains when a pi lock is held across that srcu_read_unlock(). Although this is a lockdep false positive (there is no other CPU to complete the deadlock cycle), lockdep is what it is at the moment. This commit therefore prevents rcutorture from holding pi lock across a Tiny srcu_read_unlock(). Signed-off-by: Paul E. McKenney <[email protected]>

There is only the one OOM error case in main_func(), so this commit eliminates the errexit local variable in favor of a branch to cleanup code. Signed-off-by: Li Zhijian <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>

0Day/LKP observed that the refscale results fail to complete when larger values of nrun (such as 300) are specified. The problem is that printk() can accept at most a 1024-byte buffer. This commit therefore prints the buffer whenever its length exceeds 800 bytes. CC: Philip Li <[email protected]> Reported-by: kernel test robot <[email protected]> Signed-off-by: Li Zhijian <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>

This commit brings the kvm.sh script's help text up to date with recently (and some not-so-recently) added parameters. Signed-off-by: Paul E. McKenney <[email protected]>

This commit makes kvm-find-errors.sh check for and report undefined symbols that are detected at link time. Signed-off-by: Paul E. McKenney <[email protected]>

Currently, a transient network error can kill a run if it happens while downloading the tarball to one of the target systems. This commit therefore does a 60-second wait and then a retry. If further experience indicates, a more elaborate mechanism might be used later. Signed-off-by: Paul E. McKenney <[email protected]>

…oding This commit enables two callback-flood kthreads for the TREE02 scenario and 28 for the TREE10 scenario. Cc: Neeraj Upadhyay <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>

[ Upstream commit 4e264be ] When a system with E810 with existing VFs gets rebooted the following hang may be observed. Pid 1 is hung in iavf_remove(), part of a network driver: PID: 1 TASK: ffff965400e5a340 CPU: 24 COMMAND: "systemd-shutdow" #0 [ffffaad04005fa50] __schedule at ffffffff8b3239cb #1 [ffffaad04005fae8] schedule at ffffffff8b323e2d #2 [ffffaad04005fb00] schedule_hrtimeout_range_clock at ffffffff8b32cebc #3 [ffffaad04005fb80] usleep_range_state at ffffffff8b32c930 #4 [ffffaad04005fbb0] iavf_remove at ffffffffc12b9b4c [iavf] #5 [ffffaad04005fbf0] pci_device_remove at ffffffff8add7513 #6 [ffffaad04005fc10] device_release_driver_internal at ffffffff8af08baa #7 [ffffaad04005fc40] pci_stop_bus_device at ffffffff8adcc5fc #8 [ffffaad04005fc60] pci_stop_and_remove_bus_device at ffffffff8adcc81e #9 [ffffaad04005fc70] pci_iov_remove_virtfn at ffffffff8adf9429 #10 [ffffaad04005fca8] sriov_disable at ffffffff8adf98e4 #11 [ffffaad04005fcc8] ice_free_vfs at ffffffffc04bb2c8 [ice] #12 [ffffaad04005fd10] ice_remove at ffffffffc04778fe [ice] #13 [ffffaad04005fd38] ice_shutdown at ffffffffc0477946 [ice] #14 [ffffaad04005fd50] pci_device_shutdown at ffffffff8add58f1 #15 [ffffaad04005fd70] device_shutdown at ffffffff8af05386 #16 [ffffaad04005fd98] kernel_restart at ffffffff8a92a870 #17 [ffffaad04005fda8] __do_sys_reboot at ffffffff8a92abd6 #18 [ffffaad04005fee0] do_syscall_64 at ffffffff8b317159 #19 [ffffaad04005ff08] __context_tracking_enter at ffffffff8b31b6fc #20 [ffffaad04005ff18] syscall_exit_to_user_mode at ffffffff8b31b50d #21 [ffffaad04005ff28] do_syscall_64 at ffffffff8b317169 #22 [ffffaad04005ff50] entry_SYSCALL_64_after_hwframe at ffffffff8b40009b RIP: 00007f1baa5c13d7 RSP: 00007fffbcc55a98 RFLAGS: 00000202 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1baa5c13d7 RDX: 0000000001234567 RSI: 0000000028121969 RDI: 00000000fee1dead RBP: 00007fffbcc55ca0 R8: 0000000000000000 R9: 00007fffbcc54e90 R10: 00007fffbcc55050 R11: 0000000000000202 R12: 0000000000000005 R13: 0000000000000000 R14: 00007fffbcc55af0 R15: 0000000000000000 ORIG_RAX: 00000000000000a9 CS: 0033 SS: 002b During reboot all drivers PM shutdown callbacks are invoked. In iavf_shutdown() the adapter state is changed to __IAVF_REMOVE. In ice_shutdown() the call chain above is executed, which at some point calls iavf_remove(). However iavf_remove() expects the VF to be in one of the states __IAVF_RUNNING, __IAVF_DOWN or __IAVF_INIT_FAILED. If that's not the case it sleeps forever. So if iavf_shutdown() gets invoked before iavf_remove() the system will hang indefinitely because the adapter is already in state __IAVF_REMOVE. Fix this by returning from iavf_remove() if the state is __IAVF_REMOVE, as we already went through iavf_shutdown(). Fixes: 9745780 ("iavf: Add waiting so the port is initialized in remove") Fixes: a841733 ("iavf: Fix race condition between iavf_shutdown and iavf_remove") Reported-by: Marius Cornea <[email protected]> Signed-off-by: Stefan Assmann <[email protected]> Reviewed-by: Michal Kubiak <[email protected]> Tested-by: Rafal Romanowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

Currently, the per cpu upcall counters are allocated after the vport is created and inserted into the system. This could lead to the datapath accessing the counters before they are allocated resulting in a kernel Oops. Here is an example: PID: 59693 TASK: ffff0005f4f51500 CPU: 0 COMMAND: "ovs-vswitchd" #0 [ffff80000a39b5b0] __switch_to at ffffb70f0629f2f4 #1 [ffff80000a39b5d0] __schedule at ffffb70f0629f5cc #2 [ffff80000a39b650] preempt_schedule_common at ffffb70f0629fa60 #3 [ffff80000a39b670] dynamic_might_resched at ffffb70f0629fb58 #4 [ffff80000a39b680] mutex_lock_killable at ffffb70f062a1388 #5 [ffff80000a39b6a0] pcpu_alloc at ffffb70f0594460c #6 [ffff80000a39b750] __alloc_percpu_gfp at ffffb70f05944e68 #7 [ffff80000a39b760] ovs_vport_cmd_new at ffffb70ee6961b90 [openvswitch] ... PID: 58682 TASK: ffff0005b2f0bf00 CPU: 0 COMMAND: "kworker/0:3" #0 [ffff80000a5d2f40] machine_kexec at ffffb70f056a0758 #1 [ffff80000a5d2f70] __crash_kexec at ffffb70f057e2994 #2 [ffff80000a5d3100] crash_kexec at ffffb70f057e2ad8 #3 [ffff80000a5d3120] die at ffffb70f0628234c #4 [ffff80000a5d31e0] die_kernel_fault at ffffb70f062828a8 #5 [ffff80000a5d3210] __do_kernel_fault at ffffb70f056a31f4 #6 [ffff80000a5d3240] do_bad_area at ffffb70f056a32a4 #7 [ffff80000a5d3260] do_translation_fault at ffffb70f062a9710 #8 [ffff80000a5d3270] do_mem_abort at ffffb70f056a2f74 #9 [ffff80000a5d32a0] el1_abort at ffffb70f06297dac #10 [ffff80000a5d32d0] el1h_64_sync_handler at ffffb70f06299b24 #11 [ffff80000a5d3410] el1h_64_sync at ffffb70f056812dc #12 [ffff80000a5d3430] ovs_dp_upcall at ffffb70ee6963c84 [openvswitch] #13 [ffff80000a5d3470] ovs_dp_process_packet at ffffb70ee6963fdc [openvswitch] #14 [ffff80000a5d34f0] ovs_vport_receive at ffffb70ee6972c78 [openvswitch] #15 [ffff80000a5d36f0] netdev_port_receive at ffffb70ee6973948 [openvswitch] #16 [ffff80000a5d3720] netdev_frame_hook at ffffb70ee6973a28 [openvswitch] #17 [ffff80000a5d3730] __netif_receive_skb_core.constprop.0 at ffffb70f06079f90 We moved the per cpu upcall counter allocation to the existing vport alloc and free functions to solve this. Fixes: 95637d9 ("net: openvswitch: release vport resources on failure") Fixes: 1933ea3 ("net: openvswitch: Add support to count upcall packets") Signed-off-by: Eelco Chaudron <[email protected]> Reviewed-by: Simon Horman <[email protected]> Acked-by: Aaron Conole <[email protected]> Signed-off-by: David S. Miller <[email protected]>

The cited commit holds encap tbl lock unconditionally when setting up dests. But it may cause the following deadlock: PID: 1063722 TASK: ffffa062ca5d0000 CPU: 13 COMMAND: "handler8" #0 [ffffb14de05b7368] __schedule at ffffffffa1d5aa91 #1 [ffffb14de05b7410] schedule at ffffffffa1d5afdb #2 [ffffb14de05b7430] schedule_preempt_disabled at ffffffffa1d5b528 #3 [ffffb14de05b7440] __mutex_lock at ffffffffa1d5d6cb #4 [ffffb14de05b74e8] mutex_lock_nested at ffffffffa1d5ddeb #5 [ffffb14de05b74f8] mlx5e_tc_tun_encap_dests_set at ffffffffc12f2096 [mlx5_core] #6 [ffffb14de05b7568] post_process_attr at ffffffffc12d9fc5 [mlx5_core] #7 [ffffb14de05b75a0] mlx5e_tc_add_fdb_flow at ffffffffc12de877 [mlx5_core] #8 [ffffb14de05b75f0] __mlx5e_add_fdb_flow at ffffffffc12e0eef [mlx5_core] #9 [ffffb14de05b7660] mlx5e_tc_add_flow at ffffffffc12e12f7 [mlx5_core] #10 [ffffb14de05b76b8] mlx5e_configure_flower at ffffffffc12e1686 [mlx5_core] #11 [ffffb14de05b7720] mlx5e_rep_indr_offload at ffffffffc12e3817 [mlx5_core] #12 [ffffb14de05b7730] mlx5e_rep_indr_setup_tc_cb at ffffffffc12e388a [mlx5_core] #13 [ffffb14de05b7740] tc_setup_cb_add at ffffffffa1ab2ba8 #14 [ffffb14de05b77a0] fl_hw_replace_filter at ffffffffc0bdec2f [cls_flower] #15 [ffffb14de05b7868] fl_change at ffffffffc0be6caa [cls_flower] #16 [ffffb14de05b7908] tc_new_tfilter at ffffffffa1ab71f0 [1031218.028143] wait_for_completion+0x24/0x30 [1031218.028589] mlx5e_update_route_decap_flows+0x9a/0x1e0 [mlx5_core] [1031218.029256] mlx5e_tc_fib_event_work+0x1ad/0x300 [mlx5_core] [1031218.029885] process_one_work+0x24e/0x510 Actually no need to hold encap tbl lock if there is no encap action. Fix it by checking if encap action exists or not before holding encap tbl lock. Fixes: 37c3b9f ("net/mlx5e: Prevent encap offload when neigh update is running") Signed-off-by: Chris Mi <[email protected]> Reviewed-by: Vlad Buslov <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>

[ Upstream commit 0b0747d ] The following processes run into a deadlock. CPU 41 was waiting for CPU 29 to handle a CSD request while holding spinlock "crashdump_lock", but CPU 29 was hung by that spinlock with IRQs disabled. PID: 17360 TASK: ffff95c1090c5c40 CPU: 41 COMMAND: "mrdiagd" !# 0 [ffffb80edbf37b58] __read_once_size at ffffffff9b871a40 include/linux/compiler.h:185:0 !# 1 [ffffb80edbf37b58] atomic_read at ffffffff9b871a40 arch/x86/include/asm/atomic.h:27:0 !# 2 [ffffb80edbf37b58] dump_stack at ffffffff9b871a40 lib/dump_stack.c:54:0 # 3 [ffffb80edbf37b78] csd_lock_wait_toolong at ffffffff9b131ad5 kernel/smp.c:364:0 # 4 [ffffb80edbf37b78] __csd_lock_wait at ffffffff9b131ad5 kernel/smp.c:384:0 # 5 [ffffb80edbf37bf8] csd_lock_wait at ffffffff9b13267a kernel/smp.c:394:0 # 6 [ffffb80edbf37bf8] smp_call_function_many at ffffffff9b13267a kernel/smp.c:843:0 # 7 [ffffb80edbf37c50] smp_call_function at ffffffff9b13279d kernel/smp.c:867:0 # 8 [ffffb80edbf37c50] on_each_cpu at ffffffff9b13279d kernel/smp.c:976:0 # 9 [ffffb80edbf37c78] flush_tlb_kernel_range at ffffffff9b085c4b arch/x86/mm/tlb.c:742:0 #10 [ffffb80edbf37cb8] __purge_vmap_area_lazy at ffffffff9b23a1e0 mm/vmalloc.c:701:0 #11 [ffffb80edbf37ce0] try_purge_vmap_area_lazy at ffffffff9b23a2cc mm/vmalloc.c:722:0 #12 [ffffb80edbf37ce0] free_vmap_area_noflush at ffffffff9b23a2cc mm/vmalloc.c:754:0 #13 [ffffb80edbf37cf8] free_unmap_vmap_area at ffffffff9b23bb3b mm/vmalloc.c:764:0 #14 [ffffb80edbf37cf8] remove_vm_area at ffffffff9b23bb3b mm/vmalloc.c:1509:0 #15 [ffffb80edbf37d18] __vunmap at ffffffff9b23bb8a mm/vmalloc.c:1537:0 #16 [ffffb80edbf37d40] vfree at ffffffff9b23bc85 mm/vmalloc.c:1612:0 #17 [ffffb80edbf37d58] megasas_free_host_crash_buffer [megaraid_sas] at ffffffffc020b7f2 drivers/scsi/megaraid/megaraid_sas_fusion.c:3932:0 #18 [ffffb80edbf37d80] fw_crash_state_store [megaraid_sas] at ffffffffc01f804d drivers/scsi/megaraid/megaraid_sas_base.c:3291:0 #19 [ffffb80edbf37dc0] dev_attr_store at ffffffff9b56dd7b drivers/base/core.c:758:0 #20 [ffffb80edbf37dd0] sysfs_kf_write at ffffffff9b326acf fs/sysfs/file.c:144:0 #21 [ffffb80edbf37de0] kernfs_fop_write at ffffffff9b325fd4 fs/kernfs/file.c:316:0 #22 [ffffb80edbf37e20] __vfs_write at ffffffff9b29418a fs/read_write.c:480:0 #23 [ffffb80edbf37ea8] vfs_write at ffffffff9b294462 fs/read_write.c:544:0 #24 [ffffb80edbf37ee8] SYSC_write at ffffffff9b2946ec fs/read_write.c:590:0 #25 [ffffb80edbf37ee8] SyS_write at ffffffff9b2946ec fs/read_write.c:582:0 #26 [ffffb80edbf37f30] do_syscall_64 at ffffffff9b003ca9 arch/x86/entry/common.c:298:0 #27 [ffffb80edbf37f58] entry_SYSCALL_64 at ffffffff9ba001b1 arch/x86/entry/entry_64.S:238:0 PID: 17355 TASK: ffff95c1090c3d80 CPU: 29 COMMAND: "mrdiagd" !# 0 [ffffb80f2d3c7d30] __read_once_size at ffffffff9b0f2ab0 include/linux/compiler.h:185:0 !# 1 [ffffb80f2d3c7d30] native_queued_spin_lock_slowpath at ffffffff9b0f2ab0 kernel/locking/qspinlock.c:368:0 # 2 [ffffb80f2d3c7d58] pv_queued_spin_lock_slowpath at ffffffff9b0f244b arch/x86/include/asm/paravirt.h:674:0 # 3 [ffffb80f2d3c7d58] queued_spin_lock_slowpath at ffffffff9b0f244b arch/x86/include/asm/qspinlock.h:53:0 # 4 [ffffb80f2d3c7d68] queued_spin_lock at ffffffff9b8961a6 include/asm-generic/qspinlock.h:90:0 # 5 [ffffb80f2d3c7d68] do_raw_spin_lock_flags at ffffffff9b8961a6 include/linux/spinlock.h:173:0 # 6 [ffffb80f2d3c7d68] __raw_spin_lock_irqsave at ffffffff9b8961a6 include/linux/spinlock_api_smp.h:122:0 # 7 [ffffb80f2d3c7d68] _raw_spin_lock_irqsave at ffffffff9b8961a6 kernel/locking/spinlock.c:160:0 # 8 [ffffb80f2d3c7d88] fw_crash_buffer_store [megaraid_sas] at ffffffffc01f8129 drivers/scsi/megaraid/megaraid_sas_base.c:3205:0 # 9 [ffffb80f2d3c7dc0] dev_attr_store at ffffffff9b56dd7b drivers/base/core.c:758:0 #10 [ffffb80f2d3c7dd0] sysfs_kf_write at ffffffff9b326acf fs/sysfs/file.c:144:0 #11 [ffffb80f2d3c7de0] kernfs_fop_write at ffffffff9b325fd4 fs/kernfs/file.c:316:0 #12 [ffffb80f2d3c7e20] __vfs_write at ffffffff9b29418a fs/read_write.c:480:0 #13 [ffffb80f2d3c7ea8] vfs_write at ffffffff9b294462 fs/read_write.c:544:0 #14 [ffffb80f2d3c7ee8] SYSC_write at ffffffff9b2946ec fs/read_write.c:590:0 #15 [ffffb80f2d3c7ee8] SyS_write at ffffffff9b2946ec fs/read_write.c:582:0 #16 [ffffb80f2d3c7f30] do_syscall_64 at ffffffff9b003ca9 arch/x86/entry/common.c:298:0 #17 [ffffb80f2d3c7f58] entry_SYSCALL_64 at ffffffff9ba001b1 arch/x86/entry/entry_64.S:238:0 The lock is used to synchronize different sysfs operations, it doesn't protect any resource that will be touched by an interrupt. Consequently it's not required to disable IRQs. Replace the spinlock with a mutex to fix the deadlock. Signed-off-by: Junxiao Bi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Mike Christie <[email protected]> Cc: [email protected] Signed-off-by: Martin K. Petersen <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit a154f5f ] The following call trace shows a deadlock issue due to recursive locking of mutex "device_mutex". First lock acquire is in target_for_each_device() and second in target_free_device(). PID: 148266 TASK: ffff8be21ffb5d00 CPU: 10 COMMAND: "iscsi_ttx" #0 [ffffa2bfc9ec3b18] __schedule at ffffffffa8060e7f #1 [ffffa2bfc9ec3ba0] schedule at ffffffffa8061224 #2 [ffffa2bfc9ec3bb8] schedule_preempt_disabled at ffffffffa80615ee #3 [ffffa2bfc9ec3bc8] __mutex_lock at ffffffffa8062fd7 #4 [ffffa2bfc9ec3c40] __mutex_lock_slowpath at ffffffffa80631d3 #5 [ffffa2bfc9ec3c50] mutex_lock at ffffffffa806320c #6 [ffffa2bfc9ec3c68] target_free_device at ffffffffc0935998 [target_core_mod] #7 [ffffa2bfc9ec3c90] target_core_dev_release at ffffffffc092f975 [target_core_mod] #8 [ffffa2bfc9ec3ca0] config_item_put at ffffffffa79d250f #9 [ffffa2bfc9ec3cd0] config_item_put at ffffffffa79d2583 #10 [ffffa2bfc9ec3ce0] target_devices_idr_iter at ffffffffc0933f3a [target_core_mod] #11 [ffffa2bfc9ec3d00] idr_for_each at ffffffffa803f6fc #12 [ffffa2bfc9ec3d60] target_for_each_device at ffffffffc0935670 [target_core_mod] #13 [ffffa2bfc9ec3d98] transport_deregister_session at ffffffffc0946408 [target_core_mod] #14 [ffffa2bfc9ec3dc8] iscsit_close_session at ffffffffc09a44a6 [iscsi_target_mod] #15 [ffffa2bfc9ec3df0] iscsit_close_connection at ffffffffc09a4a88 [iscsi_target_mod] #16 [ffffa2bfc9ec3df8] finish_task_switch at ffffffffa76e5d07 #17 [ffffa2bfc9ec3e78] iscsit_take_action_for_connection_exit at ffffffffc0991c23 [iscsi_target_mod] #18 [ffffa2bfc9ec3ea0] iscsi_target_tx_thread at ffffffffc09a403b [iscsi_target_mod] #19 [ffffa2bfc9ec3f08] kthread at ffffffffa76d8080 #20 [ffffa2bfc9ec3f50] ret_from_fork at ffffffffa8200364 Fixes: 36d4cb4 ("scsi: target: Avoid that EXTENDED COPY commands trigger lock inversion") Signed-off-by: Junxiao Bi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Mike Christie <[email protected]> Signed-off-by: Martin K. Petersen <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it still means hlist_for_each_entry_rcu can return an item that got removed from the list. The memory itself of such item is not freed thanks to RCU but nothing guarantees the actual content of the memory is sane. In particular, the reference count can be zero. This can happen if ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough timing, this can happen: 1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry. 2. Then, the whole ipv6_del_addr is executed for the given entry. The reference count drops to zero and kfree_rcu is scheduled. 3. ipv6_get_ifaddr continues and tries to increments the reference count (in6_ifa_hold). 4. The rcu is unlocked and the entry is freed. 5. The freed entry is returned. Prevent increasing of the reference count in such case. The name in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe. [ 41.506330] refcount_t: addition on 0; use-after-free. [ 41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130 [ 41.507413] Modules linked in: veth bridge stp llc [ 41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14 [ 41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) [ 41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130 [ 41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff [ 41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282 [ 41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000 [ 41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900 [ 41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff [ 41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000 [ 41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48 [ 41.514086] FS: 00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000 [ 41.514726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0 [ 41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 41.516799] Call Trace: [ 41.517037] <TASK> [ 41.517249] ? __warn+0x7b/0x120 [ 41.517535] ? refcount_warn_saturate+0xa5/0x130 [ 41.517923] ? report_bug+0x164/0x190 [ 41.518240] ? handle_bug+0x3d/0x70 [ 41.518541] ? exc_invalid_op+0x17/0x70 [ 41.520972] ? asm_exc_invalid_op+0x1a/0x20 [ 41.521325] ? refcount_warn_saturate+0xa5/0x130 [ 41.521708] ipv6_get_ifaddr+0xda/0xe0 [ 41.522035] inet6_rtm_getaddr+0x342/0x3f0 [ 41.522376] ? __pfx_inet6_rtm_getaddr+0x10/0x10 [ 41.522758] rtnetlink_rcv_msg+0x334/0x3d0 [ 41.523102] ? netlink_unicast+0x30f/0x390 [ 41.523445] ? __pfx_rtnetlink_rcv_msg+0x10/0x10 [ 41.523832] netlink_rcv_skb+0x53/0x100 [ 41.524157] netlink_unicast+0x23b/0x390 [ 41.524484] netlink_sendmsg+0x1f2/0x440 [ 41.524826] __sys_sendto+0x1d8/0x1f0 [ 41.525145] __x64_sys_sendto+0x1f/0x30 [ 41.525467] do_syscall_64+0xa5/0x1b0 [ 41.525794] entry_SYSCALL_64_after_hwframe+0x72/0x7a [ 41.526213] RIP: 0033:0x7fbc4cfcea9a [ 41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89 [ 41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c [ 41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a [ 41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005 [ 41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c [ 41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b [ 41.531573] </TASK> Fixes: 5c578ae ("IPv6: convert addrconf hash list to RCU") Reviewed-by: Eric Dumazet <[email protected]> Reviewed-by: David Ahern <[email protected]> Signed-off-by: Jiri Benc <[email protected]> Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com Signed-off-by: Jakub Kicinski <[email protected]>

vhost_worker will call tun call backs to receive packets. If too many illegal packets arrives, tun_do_read will keep dumping packet contents. When console is enabled, it will costs much more cpu time to dump packet and soft lockup will be detected. net_ratelimit mechanism can be used to limit the dumping rate. PID: 33036 TASK: ffff949da6f20000 CPU: 23 COMMAND: "vhost-32980" #0 [fffffe00003fce50] crash_nmi_callback at ffffffff89249253 #1 [fffffe00003fce58] nmi_handle at ffffffff89225fa3 #2 [fffffe00003fceb0] default_do_nmi at ffffffff8922642e #3 [fffffe00003fced0] do_nmi at ffffffff8922660d #4 [fffffe00003fcef0] end_repeat_nmi at ffffffff89c01663 [exception RIP: io_serial_in+20] RIP: ffffffff89792594 RSP: ffffa655314979e8 RFLAGS: 00000002 RAX: ffffffff89792500 RBX: ffffffff8af428a0 RCX: 0000000000000000 RDX: 00000000000003fd RSI: 0000000000000005 RDI: ffffffff8af428a0 RBP: 0000000000002710 R8: 0000000000000004 R9: 000000000000000f R10: 0000000000000000 R11: ffffffff8acbf64f R12: 0000000000000020 R13: ffffffff8acbf698 R14: 0000000000000058 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #5 [ffffa655314979e8] io_serial_in at ffffffff89792594 #6 [ffffa655314979e8] wait_for_xmitr at ffffffff89793470 #7 [ffffa65531497a08] serial8250_console_putchar at ffffffff897934f6 #8 [ffffa65531497a20] uart_console_write at ffffffff8978b605 #9 [ffffa65531497a48] serial8250_console_write at ffffffff89796558 #10 [ffffa65531497ac8] console_unlock at ffffffff89316124 #11 [ffffa65531497b10] vprintk_emit at ffffffff89317c07 #12 [ffffa65531497b68] printk at ffffffff89318306 #13 [ffffa65531497bc8] print_hex_dump at ffffffff89650765 #14 [ffffa65531497ca8] tun_do_read at ffffffffc0b06c27 [tun] #15 [ffffa65531497d38] tun_recvmsg at ffffffffc0b06e34 [tun] #16 [ffffa65531497d68] handle_rx at ffffffffc0c5d682 [vhost_net] #17 [ffffa65531497ed0] vhost_worker at ffffffffc0c644dc [vhost] #18 [ffffa65531497f10] kthread at ffffffff892d2e72 #19 [ffffa65531497f50] ret_from_fork at ffffffff89c0022f Fixes: ef3db4a ("tun: avoid BUG, dump packet on GSO errors") Signed-off-by: Lei Chen <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Acked-by: Jason Wang <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Acked-by: Michael S. Tsirkin <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>

The rehash delayed work migrates filters from one region to another. This is done by iterating over all chunks (all the filters with the same priority) in the region and in each chunk iterating over all the filters. If the migration fails, the code tries to migrate the filters back to the old region. However, the rollback itself can also fail in which case another migration will be erroneously performed. Besides the fact that this ping pong is not a very good idea, it also creates a problem. Each virtual chunk references two chunks: The currently used one ('vchunk->chunk') and a backup ('vchunk->chunk2'). During migration the first holds the chunk we want to migrate filters to and the second holds the chunk we are migrating filters from. The code currently assumes - but does not verify - that the backup chunk does not exist (NULL) if the currently used chunk does not reference the target region. This assumption breaks when we are trying to rollback a rollback, resulting in the backup chunk being overwritten and leaked [1]. Fix by not rolling back a failed rollback and add a warning to avoid future cases. [1] WARNING: CPU: 5 PID: 1063 at lib/parman.c:291 parman_destroy+0x17/0x20 Modules linked in: CPU: 5 PID: 1063 Comm: kworker/5:11 Tainted: G W 6.9.0-rc2-custom-00784-gc6a05c468a0b #14 Hardware name: Mellanox Technologies Ltd. MSN3700/VMOD0005, BIOS 5.11 01/06/2019 Workqueue: mlxsw_core mlxsw_sp_acl_tcam_vregion_rehash_work RIP: 0010:parman_destroy+0x17/0x20 [...] Call Trace: <TASK> mlxsw_sp_acl_atcam_region_fini+0x19/0x60 mlxsw_sp_acl_tcam_region_destroy+0x49/0xf0 mlxsw_sp_acl_tcam_vregion_rehash_work+0x1f1/0x470 process_one_work+0x151/0x370 worker_thread+0x2cb/0x3e0 kthread+0xd0/0x100 ret_from_fork+0x34/0x50 ret_from_fork_asm+0x1a/0x30 </TASK> Fixes: 8435005 ("mlxsw: spectrum_acl: Do rollback as another call to mlxsw_sp_acl_tcam_vchunk_migrate_all()") Signed-off-by: Ido Schimmel <[email protected]> Tested-by: Alexander Zubkov <[email protected]> Reviewed-by: Petr Machata <[email protected]> Signed-off-by: Petr Machata <[email protected]> Reviewed-by: Simon Horman <[email protected]> Link: https://lore.kernel.org/r/d5edd4f4503934186ae5cfe268503b16345b4e0f.1713797103.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <[email protected]>

paulmckrcu and others added 30 commits November 30, 2021 17:20

rcu: Move rcu_needs_cpu() to tree.c

bc849e9

Now that RCU_FAST_NO_HZ is no more, there is but one implementation of the rcu_needs_cpu() function. This commit therefore moves this function from kernel/rcu/tree_plugin.c to kernel/rcu/tree.c. Signed-off-by: Paul E. McKenney <[email protected]>

rcu: in_irq() cleanup

2407a64

This commit replaces the obsolete and ambiguous macro in_irq() with its shiny new in_hardirq() equivalent. Signed-off-by: Changbin Du <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>

rcu: Improve tree_plugin.h comments and add code cleanups

17ea371

This commit cleans up some comments and code in kernel/rcu/tree_plugin.h. Signed-off-by: Zhouyi Zhou <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>

tools/nolibc: Implement gettid()

b0fe9de

Allow test programs to determine their thread ID. Signed-off-by: Mark Brown <[email protected]> Cc: Willy Tarreau <[email protected]> Signed-off-by: Willy Tarreau <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>

refscale: Simplify the errexit checkpoint

c30c876

There is only the one OOM error case in main_func(), so this commit eliminates the errexit local variable in favor of a branch to cleanup code. Signed-off-by: Li Zhijian <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>

torture: Catch kvm.sh help text up with actual options

b6a4fd3

This commit brings the kvm.sh script's help text up to date with recently (and some not-so-recently) added parameters. Signed-off-by: Paul E. McKenney <[email protected]>

torture: Make kvm-find-errors.sh report link-time undefined symbols

c06354a

This commit makes kvm-find-errors.sh check for and report undefined symbols that are detected at link time. Signed-off-by: Paul E. McKenney <[email protected]>

rcutorture: Cause TREE02 and TREE10 scenarios to do more callback flo…

4ead4e3

…oding This commit enables two callback-flood kthreads for the TREE02 scenario and 28 for the TREE10 scenario. Cc: Neeraj Upadhyay <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GIT PULL] RCU changes for v5.17 #14

[GIT PULL] RCU changes for v5.17 #14

ammarfaizi2 commented Jan 10, 2022 •

edited

Loading

[GIT PULL] RCU changes for v5.17 #14

[GIT PULL] RCU changes for v5.17 #14

Conversation

ammarfaizi2 commented Jan 10, 2022 • edited Loading

ammarfaizi2 commented Jan 10, 2022 •

edited

Loading