Skip to content
This repository has been archived by the owner on Apr 18, 2024. It is now read-only.

BUG: unable to handle kernel NULL pointer dereference at tcp_validate_incoming - tp->mptcp == NULL but tp->mpc is set #71

Closed
cpaasch opened this issue Jan 19, 2015 · 8 comments

Comments

@cpaasch
Copy link
Member

cpaasch commented Jan 19, 2015

Happened at least once with test simple_abndiff on f9edddd (v3.17 + MPTCP) but with the scheduling-patches (#70). However, seems to be absolutely unrelated to the scheduling.

Crash happens when accessing tp->mptcp in mptcp_reset_mopt().

[267465.961722] BUG: unable to handle kernel NULL pointer dereference at 0000000000000019
[267465.968188] IP: [<ffffffff8166305c>] tcp_validate_incoming+0x35c/0x3b0
[267465.974360] PGD 0
[267465.977647] Oops: 0000 [#1] SMP
[267465.981644] Modules linked in:
[267465.987349] CPU: 3 PID: 26928 Comm: apache2 Not tainted 3.17.0.mptcp #110
[267465.991906] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[267465.991906] task: ffff88002f2e2000 ti: ffff88003b5d0000 task.ti: ffff88003b5d0000
[267465.991906] RIP: 0010:[<ffffffff8166305c>]  [<ffffffff8166305c>] tcp_validate_incoming+0x35c/0x3b0
[267465.991906] RSP: 0018:ffff88003b5d3c18  EFLAGS: 00010202
[267465.991906] RAX: 0000000000000000 RBX: ffff88003c9cbd80 RCX: 0000000000000000
[267465.991906] RDX: 0000000000000000 RSI: ffff88002f2e26f0 RDI: 0000000000000286
[267465.991906] RBP: ffff88003b5d3c48 R08: 0000000000000000 R09: 0000000000000000
[267465.991906] R10: 0000000000000002 R11: 0000000000000000 R12: ffff88003c160c00
[267465.991906] R13: ffff88002f6b3462 R14: 0000000000000000 R15: ffff88003c160c28
[267465.991906] FS:  00007fd6f10c0700(0000) GS:ffff88003fd80000(0000) knlGS:0000000000000000
[267465.991906] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[267465.991906] CR2: 0000000000000019 CR3: 000000002ee47000 CR4: 00000000000006e0
[267465.991906] Stack:
[267465.991906]  0000000000000002 ffff88003c9cbd80 ffff88003c160c00 ffff88002f6b3462
[267465.991906]  0000000000000000 ffff88002ec9e988 ffff88003b5d3c98 ffffffff816641c8
[267465.991906]  ffff88003b5d3c78 ffff88003c160600 ffff88003e001700 ffff88003c160c00
[267465.991906] Call Trace:
[267465.991906]  [<ffffffff816641c8>] tcp_rcv_state_process+0x208/0x910
[267465.991906]  [<ffffffff8166ebf3>] tcp_v4_do_rcv+0xe3/0x2a0
[267465.991906]  [<ffffffff81703a10>] ? mptcp_backlog_rcv+0xa0/0xb0
[267465.991906]  [<ffffffff816f12fd>] tcp_v6_do_rcv+0x1dd/0x420
[267465.991906]  [<ffffffff817039bb>] mptcp_backlog_rcv+0x4b/0xb0
[267465.991906]  [<ffffffff815e87a2>] release_sock+0x92/0x1f0
[267465.991906]  [<ffffffff81705645>] mptcp_close+0x1f5/0x5c0
[267465.991906]  [<ffffffff816576fd>] tcp_close+0x2ed/0x4a0
[267465.991906]  [<ffffffff81685b4e>] inet_release+0xae/0xf0
[267465.991906]  [<ffffffff81685ac9>] ? inet_release+0x29/0xf0
[267465.991906]  [<ffffffff816c2d0f>] inet6_release+0x3f/0x50
[267465.991906]  [<ffffffff815e1349>] sock_release+0x29/0xa0
[267465.991906]  [<ffffffff815e1542>] sock_close+0x12/0x20
[267465.991906]  [<ffffffff811a4678>] __fput+0xc8/0x210
[267465.991906]  [<ffffffff811a486e>] ____fput+0xe/0x10
[267465.991906]  [<ffffffff8108a70d>] task_work_run+0xad/0xe0
[267465.991906]  [<ffffffff81014f45>] do_notify_resume+0x75/0x80
[267465.991906]  [<ffffffff813770de>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[267465.991906]  [<ffffffff817317e2>] int_signal+0x12/0x17
[267465.991906] Code: 8b cc 07 00 00 c1 e0 0a f7 f6 39 c1 0f 87 4c ff ff ff e9 12 fe ff ff f6 83 58 09 00 00 01 0f 84 9f fd ff ff 48 8b 83 60 09 00 00 <0f> b6 50 19 80 60 18 8f 83 e2 80 88 50 19 e9 85 fd ff ff 3b b3
[267465.991906] RIP  [<ffffffff8166305c>] tcp_validate_incoming+0x35c/0x3b0
[267465.991906]  RSP <ffff88003b5d3c18>
[267465.991906] CR2: 0000000000000019
[267465.991906] ---[ end trace dca7bd246b046643 ]---
[267465.991906] Kernel panic - not syncing: Fatal exception in interrupt
[267465.991906] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
[267465.991906] ---[ end Kernel panic - not syncing: Fatal exception in interrupt
@andywu106
Copy link

Hi, cpaasch, any progress on this issue?

@cpaasch
Copy link
Member Author

cpaasch commented Aug 10, 2015

I haven't seen this bug happening again. Can you reproduce it?

@andywu106
Copy link

On 2015/8/10 23:52, Christoph Paasch wrote:

I haven't seen this bug happening again. Can you reproduce it?

I hit this once in mptcp_parse_options(), but I can not reproduce it too :)


Reply to this email directly or view it on GitHub #71 (comment).

@cpaasch
Copy link
Member Author

cpaasch commented Aug 11, 2015

Good that you hit it as well! This confirms that I was not dreaming :)

@cpaasch
Copy link
Member Author

cpaasch commented Aug 11, 2015

Are you sure that it happened in mptcp_handle_options() ?

Because, I am suspecting that something might go wrong when processing a RST (which would result in the socket being removed) while processing segments from the backlog queue.

Maybe you still have the crash-trace?

@andywu106
Copy link

It did happend in mptcp_handle_options(), but I did not keep the crash-trace.

@andywu106
Copy link

It did happened in mptcp_handle_options(), but I did not keep the crash-trace.

@cpaasch cpaasch changed the title tp->mptcp == NULL but tp->mpc is set BUG: unable to handle kernel NULL pointer dereference at tcp_validate_incoming - tp->mptcp == NULL but tp->mpc is set Aug 20, 2015
cpaasch pushed a commit that referenced this issue Mar 21, 2016
[ Upstream commit ddf1d39 ]

An unprivileged user can trigger an oops on a kernel with
CONFIG_CHECKPOINT_RESTORE.

proc_pid_cmdline_read takes mmap_sem for reading and obtains args + env
start/end values. These get sanity checked as follows:
        BUG_ON(arg_start > arg_end);
        BUG_ON(env_start > env_end);

These can be changed by prctl_set_mm. Turns out also takes the semaphore for
reading, effectively rendering it useless. This results in:

  kernel BUG at fs/proc/base.c:240!
  invalid opcode: 0000 [#1] SMP
  Modules linked in: virtio_net
  CPU: 0 PID: 925 Comm: a.out Not tainted 4.4.0-rc8-next-20160105dupa+ #71
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  task: ffff880077a68000 ti: ffff8800784d0000 task.ti: ffff8800784d0000
  RIP: proc_pid_cmdline_read+0x520/0x530
  RSP: 0018:ffff8800784d3db8  EFLAGS: 00010206
  RAX: ffff880077c5b6b0 RBX: ffff8800784d3f18 RCX: 0000000000000000
  RDX: 0000000000000002 RSI: 00007f78e8857000 RDI: 0000000000000246
  RBP: ffff8800784d3e40 R08: 0000000000000008 R09: 0000000000000001
  R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000050
  R13: 00007f78e8857800 R14: ffff88006fcef000 R15: ffff880077c5b600
  FS:  00007f78e884a740(0000) GS:ffff88007b200000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
  CR2: 00007f78e8361770 CR3: 00000000790a5000 CR4: 00000000000006f0
  Call Trace:
    __vfs_read+0x37/0x100
    vfs_read+0x82/0x130
    SyS_read+0x58/0xd0
    entry_SYSCALL_64_fastpath+0x12/0x76
  Code: 4c 8b 7d a8 eb e9 48 8b 9d 78 ff ff ff 4c 8b 7d 90 48 8b 03 48 39 45 a8 0f 87 f0 fe ff ff e9 d1 fe ff ff 4c 8b 7d 90 eb c6 0f 0b <0f> 0b 0f 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
  RIP   proc_pid_cmdline_read+0x520/0x530
  ---[ end trace 97882617ae9c6818 ]---

Turns out there are instances where the code just reads aformentioned
values without locking whatsoever - namely environ_read and get_cmdline.

Interestingly these functions look quite resilient against bogus values,
but I don't believe this should be relied upon.

The first patch gets rid of the oops bug by grabbing mmap_sem for
writing.

The second patch is optional and puts locking around aformentioned
consumers for safety.  Consumers of other fields don't seem to benefit
from similar treatment and are left untouched.

This patch (of 2):

The code was taking the semaphore for reading, which does not protect
against readers nor concurrent modifications.

The problem could cause a sanity checks to fail in procfs's cmdline
reader, resulting in an OOPS.

Note that some functions perform an unlocked read of various mm fields,
but they seem to be fine despite possible modificaton.

Signed-off-by: Mateusz Guzik <[email protected]>
Acked-by: Cyrill Gorcunov <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Jarod Wilson <[email protected]>
Cc: Jan Stancek <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
@cpaasch
Copy link
Member Author

cpaasch commented Nov 3, 2016

We had some fixes that solved memory corruption issues. This might/should have solved this here.

@cpaasch cpaasch closed this as completed Nov 3, 2016
pabeni pushed a commit to pabeni/mptcp that referenced this issue Nov 10, 2020
When target side trace in turned on and flush command is issued from the
host it results in the following Oops.

[  856.789724] BUG: kernel NULL pointer dereference, address: 0000000000000068
[  856.790686] #PF: supervisor read access in kernel mode
[  856.791262] #PF: error_code(0x0000) - not-present page
[  856.791863] PGD 6d7110067 P4D 6d7110067 PUD 66f0ad067 PMD 0
[  856.792527] Oops: 0000 [multipath-tcp#1] SMP NOPTI
[  856.792950] CPU: 15 PID: 7034 Comm: nvme Tainted: G           OE     5.9.0nvme-5.9+ multipath-tcp#71
[  856.793790] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e3214
[  856.794956] RIP: 0010:trace_event_raw_event_nvmet_req_init+0x13e/0x170 [nvmet]
[  856.795734] Code: 41 5c 41 5d c3 31 d2 31 f6 e8 4e 9b b8 e0 e9 0e ff ff ff 49 8b 55 00 48 8b 38 8b 0
[  856.797740] RSP: 0018:ffffc90001be3a60 EFLAGS: 00010246
[  856.798375] RAX: 0000000000000000 RBX: ffff8887e7d2c01c RCX: 0000000000000000
[  856.799234] RDX: 0000000000000020 RSI: 0000000057e70ea2 RDI: ffff8887e7d2c034
[  856.800088] RBP: ffff88869f710578 R08: ffff888807500d40 R09: 00000000fffffffe
[  856.800951] R10: 0000000064c66670 R11: 00000000ef955201 R12: ffff8887e7d2c034
[  856.801807] R13: ffff88869f7105c8 R14: 0000000000000040 R15: ffff88869f710440
[  856.802667] FS:  00007f6a22bd8780(0000) GS:ffff888813a00000(0000) knlGS:0000000000000000
[  856.803635] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  856.804367] CR2: 0000000000000068 CR3: 00000006d73e0000 CR4: 00000000003506e0
[  856.805283] Call Trace:
[  856.805613]  nvmet_req_init+0x27c/0x480 [nvmet]
[  856.806200]  nvme_loop_queue_rq+0xcb/0x1d0 [nvme_loop]
[  856.806862]  blk_mq_dispatch_rq_list+0x123/0x7b0
[  856.807459]  ? kvm_sched_clock_read+0x14/0x30
[  856.808025]  __blk_mq_sched_dispatch_requests+0xc7/0x170
[  856.808708]  blk_mq_sched_dispatch_requests+0x30/0x60
[  856.809372]  __blk_mq_run_hw_queue+0x70/0x100
[  856.809935]  __blk_mq_delay_run_hw_queue+0x156/0x170
[  856.810574]  blk_mq_run_hw_queue+0x86/0xe0
[  856.811104]  blk_mq_sched_insert_request+0xef/0x160
[  856.811733]  blk_execute_rq+0x69/0xc0
[  856.812212]  ? blk_mq_rq_ctx_init+0xd0/0x230
[  856.812784]  nvme_execute_passthru_rq+0x57/0x130 [nvme_core]
[  856.813461]  nvme_submit_user_cmd+0xeb/0x300 [nvme_core]
[  856.814099]  nvme_user_cmd.isra.82+0x11e/0x1a0 [nvme_core]
[  856.814752]  blkdev_ioctl+0x1dc/0x2c0
[  856.815197]  block_ioctl+0x3f/0x50
[  856.815606]  __x64_sys_ioctl+0x84/0xc0
[  856.816074]  do_syscall_64+0x33/0x40
[  856.816533]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  856.817168] RIP: 0033:0x7f6a222ed107
[  856.817617] Code: 44 00 00 48 8b 05 81 cd 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 8
[  856.819901] RSP: 002b:00007ffca848f058 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[  856.820846] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f6a222ed107
[  856.821726] RDX: 00007ffca848f060 RSI: 00000000c0484e43 RDI: 0000000000000003
[  856.822603] RBP: 0000000000000003 R08: 000000000000003f R09: 0000000000000005
[  856.823478] R10: 00007ffca848ece0 R11: 0000000000000202 R12: 00007ffca84912d3
[  856.824359] R13: 00007ffca848f4d0 R14: 0000000000000002 R15: 000000000067e900
[  856.825236] Modules linked in: nvme_loop(OE) nvmet(OE) nvme_fabrics(OE) null_blk nvme(OE) nvme_corel

Move the nvmet_req_init() tracepoint after we parse the command in
nvmet_req_init() so that we can get rid of the duplicate
nvmet_find_namespace() call.
Rename __assign_disk_name() ->  __assign_req_name(). Now that we call
tracepoint after parsing the command simplify the newly added
__assign_req_name() which fixes this bug.

Signed-off-by: Chaitanya Kulkarni <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants