-
Notifications
You must be signed in to change notification settings - Fork 270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add system calls lecture #13
Conversation
This makes it easier to edit ditaa directives since insertion will always move the rest of the row by one. Signed-off-by: Octavian Purdila <[email protected]>
Fixes the following warning: Documentation/teaching/lectures/intro.rst:737: WARNING: Bullet list ends without a blank line; unexpected unindent Signed-off-by: Octavian Purdila <[email protected]>
As noted in Sphinx #2442 new CSS added by extensions are rendered innefective if html_context its changed. So, instead, use add_stylesheet to add theme_overridesc.css Signed-off-by: Octavian Purdila <[email protected]>
Add asciicast directive that allows inserting asciinema "asciicasts" in docs. The directive accepts a single mandatory parameter which is the filename that stores the asciicast: .. asciicast:: ascii.cast Signed-off-by: Octavian Purdila <[email protected]>
Don't catch ditaa errors, let the user see them so that it is easier to understand the root cause of failures. Signed-off-by: Octavian Purdila <[email protected]>
Signed-off-by: Octavian Purdila <[email protected]>
Signed-off-by: Octavian Purdila <[email protected]>
For security reasons, when a user to kernel mode transitions occurs, | ||
the CPU's PC is set at specific kernel entry points. For system calls, | ||
this entry point will push the parameters on stack so that they are | ||
accesible by the system call functions and then it will run the system |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, it is not clear. parameters are passed in registers and then saved to kernel stack?
|
||
* The kernel entry point saves registers on the kernel stack | ||
|
||
* The system call dispatches identifies the system call function and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/dispatches/dispatcher
|
||
* When the system call function returns the userspace registers are | ||
restored, execution is switched back to user mode and the | ||
userspace application resumes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, to get to kernel mode we use a trap. How do we switch back to userspace? Just manually change the CPL?
__SYSCALL_I386(2, sys_fork, ) | ||
#else | ||
__SYSCALL_I386(2, sys_fork, ) | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The entries for sys_fork look identical to me on both branches of #ifdef. Is this intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. This code is automatically generate and that why maybe it looks like this.
|
||
For example, lets consider the case where such a check is not made for | ||
the read or write system calls. If the user passes a kernel-space | ||
pointer to a write sysytem call then it can get access to kernel data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/sysytem/system
|
||
* Invalid syscall pointer: the faulting address is in kernel | ||
space; the fault address is in userspace and it is invalid | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't get "invalid syscall pointer" bullet. :)
So do you mean invalid user space pointer that is accessed from kernel.
space; the fault address is in userspace and it is invalid | ||
|
||
* Kernel bug: same as above | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also here we should add words even if it means copying thing said above. What is the difference between kernel bug and invalid userspace pointer.
|
||
* The exact instructions that accesses user space are recorded in | ||
a table (exception table) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accesses -> access
+------------------+-----------------------+------------------------+ | ||
| Cost | Pointer checks | Fault handling | | ||
+==================+=======================+========================+ | ||
| Valid address | address space search | 0 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[valid address - fault handling] I think it is O(1) because we still need to use copy_from_user which does some limits checking.
| Invalid address | address space search | exception table search | | ||
+------------------+-----------------------+------------------------+ | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not clear up till now which is the preferred method used by the kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does say so in the paragraph above
Signed-off-by: Octavian Purdila <[email protected]>
rdma_counter_init() may fail for a device. In such case while calculating total sum, ignore NULL hstats. This fixes below observed call trace. BUG: kernel NULL pointer dereference, address: 00000000000000a0 PGD 8000001009b30067 P4D 8000001009b30067 PUD 10549c9067 PMD 0 Oops: 0000 [#1] SMP PTI CPU: 55 PID: 20887 Comm: cat Kdump: loaded Not tainted 5.2.0-rc6-jdc+ linux-kernel-labs#13 RIP: 0010:rdma_counter_get_hwstat_value+0xf2/0x150 [ib_core] Call Trace: show_hw_stats+0x5e/0x130 [ib_core] dev_attr_show+0x15/0x50 sysfs_kf_seq_show+0xc6/0x1a0 seq_read+0x132/0x370 vfs_read+0x89/0x140 ksys_read+0x5c/0xd0 do_syscall_64+0x5a/0x240 entry_SYSCALL_64_after_hwframe+0x49/0xbe Fixes: f34a55e ("RDMA/core: Get sum value of all counters when perform a sysfs stat read") Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Parav Pandit <[email protected]> Reviewed-by: Mark Zhang <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
security_secid_to_secctx is called by the bpf_lsm hook and a successful return value (i.e 0) implies that the parameter will be consumed by the LSM framework. The current behaviour return success when the pointer isn't initialized when CONFIG_BPF_LSM is enabled, with the default return from kernel/bpf/bpf_lsm.c. This is the internal error: [ 1229.341488][ T2659] usercopy: Kernel memory exposure attempt detected from null address (offset 0, size 280)! [ 1229.374977][ T2659] ------------[ cut here ]------------ [ 1229.376813][ T2659] kernel BUG at mm/usercopy.c:99! [ 1229.378398][ T2659] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP [ 1229.380348][ T2659] Modules linked in: [ 1229.381654][ T2659] CPU: 0 PID: 2659 Comm: systemd-journal Tainted: G B W 5.7.0-rc5-next-20200511-00019-g864e0c6319b8-dirty linux-kernel-labs#13 [ 1229.385429][ T2659] Hardware name: linux,dummy-virt (DT) [ 1229.387143][ T2659] pstate: 80400005 (Nzcv daif +PAN -UAO BTYPE=--) [ 1229.389165][ T2659] pc : usercopy_abort+0xc8/0xcc [ 1229.390705][ T2659] lr : usercopy_abort+0xc8/0xcc [ 1229.392225][ T2659] sp : ffff000064247450 [ 1229.393533][ T2659] x29: ffff000064247460 x28: 0000000000000000 [ 1229.395449][ T2659] x27: 0000000000000118 x26: 0000000000000000 [ 1229.397384][ T2659] x25: ffffa000127049e0 x24: ffffa000127049e0 [ 1229.399306][ T2659] x23: ffffa000127048e0 x22: ffffa000127048a0 [ 1229.401241][ T2659] x21: ffffa00012704b80 x20: ffffa000127049e0 [ 1229.403163][ T2659] x19: ffffa00012704820 x18: 0000000000000000 [ 1229.405094][ T2659] x17: 0000000000000000 x16: 0000000000000000 [ 1229.407008][ T2659] x15: 0000000000000000 x14: 003d090000000000 [ 1229.408942][ T2659] x13: ffff80000d5b25b2 x12: 1fffe0000d5b25b1 [ 1229.410859][ T2659] x11: 1fffe0000d5b25b1 x10: ffff80000d5b25b1 [ 1229.412791][ T2659] x9 : ffffa0001034bee0 x8 : ffff00006ad92d8f [ 1229.414707][ T2659] x7 : 0000000000000000 x6 : ffffa00015eacb20 [ 1229.416642][ T2659] x5 : ffff0000693c8040 x4 : 0000000000000000 [ 1229.418558][ T2659] x3 : ffffa0001034befc x2 : d57a7483a01c6300 [ 1229.420610][ T2659] x1 : 0000000000000000 x0 : 0000000000000059 [ 1229.422526][ T2659] Call trace: [ 1229.423631][ T2659] usercopy_abort+0xc8/0xcc [ 1229.425091][ T2659] __check_object_size+0xdc/0x7d4 [ 1229.426729][ T2659] put_cmsg+0xa30/0xa90 [ 1229.428132][ T2659] unix_dgram_recvmsg+0x80c/0x930 [ 1229.429731][ T2659] sock_recvmsg+0x9c/0xc0 [ 1229.431123][ T2659] ____sys_recvmsg+0x1cc/0x5f8 [ 1229.432663][ T2659] ___sys_recvmsg+0x100/0x160 [ 1229.434151][ T2659] __sys_recvmsg+0x110/0x1a8 [ 1229.435623][ T2659] __arm64_sys_recvmsg+0x58/0x70 [ 1229.437218][ T2659] el0_svc_common.constprop.1+0x29c/0x340 [ 1229.438994][ T2659] do_el0_svc+0xe8/0x108 [ 1229.440587][ T2659] el0_svc+0x74/0x88 [ 1229.441917][ T2659] el0_sync_handler+0xe4/0x8b4 [ 1229.443464][ T2659] el0_sync+0x17c/0x180 [ 1229.444920][ T2659] Code: aa1703e2 aa1603e1 910a8260 97ecc860 (d4210000) [ 1229.447070][ T2659] ---[ end trace 400497d91baeaf51 ]--- [ 1229.448791][ T2659] Kernel panic - not syncing: Fatal exception [ 1229.450692][ T2659] Kernel Offset: disabled [ 1229.452061][ T2659] CPU features: 0x240002,20002004 [ 1229.453647][ T2659] Memory Limit: none [ 1229.455015][ T2659] ---[ end Kernel panic - not syncing: Fatal exception ]--- Rework the so the default return value is -EOPNOTSUPP. There are likely other callbacks such as security_inode_getsecctx() that may have the same problem, and that someone that understand the code better needs to audit them. Thank you Arnd for helping me figure out what went wrong. Fixes: 98e828a ("security: Refactor declaration of LSM hooks") Signed-off-by: Anders Roxell <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: James Morris <[email protected]> Cc: Arnd Bergmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
When the nvmem framework is enabled, a nvmem device is created per mtd device/partition. It is not uncommon that a device can have multiple mtd devices with partitions that have the same name. Eg, when there DT overlay is allowed and the same device with mtd is attached twice. Under that circumstances, the mtd fails to register due to a name duplication on the nvmem framework. With this patch we use the mtdX name instead of the partition name, which is unique. [ 8.948991] sysfs: cannot create duplicate filename '/bus/nvmem/devices/Production Data' [ 8.948992] CPU: 7 PID: 246 Comm: systemd-udevd Not tainted 5.5.0-qtec-standard linux-kernel-labs#13 [ 8.948993] Hardware name: AMD Dibbler/Dibbler, BIOS 05.22.04.0019 10/26/2019 [ 8.948994] Call Trace: [ 8.948996] dump_stack+0x50/0x70 [ 8.948998] sysfs_warn_dup.cold+0x17/0x2d [ 8.949000] sysfs_do_create_link_sd.isra.0+0xc2/0xd0 [ 8.949002] bus_add_device+0x74/0x140 [ 8.949004] device_add+0x34b/0x850 [ 8.949006] nvmem_register.part.0+0x1bf/0x640 ... [ 8.948926] mtd mtd8: Failed to register NVMEM device Fixes: c4dfa25 ("mtd: add support for reading MTD devices via the nvmem API") Signed-off-by: Ricardo Ribalda Delgado <[email protected]> Acked-by: Miquel Raynal <[email protected]> Signed-off-by: Richard Weinberger <[email protected]>
When CONFIG_DEBUG_KMAP_LOCAL is enabled __kmap_local_sched_{in,out} check that even slots in the tsk->kmap_ctrl.pteval are unmapped. The slots are initialized with 0 value, but the check is done with pte_none. 0 pte however does not necessarily mean that pte_none will return true. e.g. on xtensa it returns false, resulting in the following runtime warnings: WARNING: CPU: 0 PID: 101 at mm/highmem.c:627 __kmap_local_sched_out+0x51/0x108 CPU: 0 PID: 101 Comm: touch Not tainted 5.17.0-rc7-00010-gd3a1cdde80d2-dirty linux-kernel-labs#13 Call Trace: dump_stack+0xc/0x40 __warn+0x8f/0x174 warn_slowpath_fmt+0x48/0xac __kmap_local_sched_out+0x51/0x108 __schedule+0x71a/0x9c4 preempt_schedule_irq+0xa0/0xe0 common_exception_return+0x5c/0x93 do_wp_page+0x30e/0x330 handle_mm_fault+0xa70/0xc3c do_page_fault+0x1d8/0x3c4 common_exception+0x7f/0x7f WARNING: CPU: 0 PID: 101 at mm/highmem.c:664 __kmap_local_sched_in+0x50/0xe0 CPU: 0 PID: 101 Comm: touch Tainted: G W 5.17.0-rc7-00010-gd3a1cdde80d2-dirty linux-kernel-labs#13 Call Trace: dump_stack+0xc/0x40 __warn+0x8f/0x174 warn_slowpath_fmt+0x48/0xac __kmap_local_sched_in+0x50/0xe0 finish_task_switch$isra$0+0x1ce/0x2f8 __schedule+0x86e/0x9c4 preempt_schedule_irq+0xa0/0xe0 common_exception_return+0x5c/0x93 do_wp_page+0x30e/0x330 handle_mm_fault+0xa70/0xc3c do_page_fault+0x1d8/0x3c4 common_exception+0x7f/0x7f Fix it by replacing !pte_none(pteval) with pte_val(pteval) != 0. Link: https://lkml.kernel.org/r/[email protected] Fixes: 5fbda3e ("sched: highmem: Store local kmaps in task struct") Signed-off-by: Max Filippov <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Cc: "Peter Zijlstra (Intel)" <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
This PR also adds an asciicast directive for asciiname playback.