-
Notifications
You must be signed in to change notification settings - Fork 16
Stability issues with v59 #14
Comments
I have two different devices that I regularly sleep and I have never had any issues of instability when resuming. Are you talking about just sleeping the computer or hibernating it? I guess I will probably have to let other poeple chime in on whether or not they have similar experiences. What CPU does your computer have? What version of Windows are you running? I am not sure what kind of logs you can gather, do you get an error message? |
Actually it's hibernate. I forgot that the Windows engineers think it's Ok to start your computer, install updates and reboot without you being able to stop it if you just sleep. That's awesome fun when your (expensive, new) laptop has terrible heat dissipation and can overheat and physically damage components, particularly when in a laptop bag
Ok, so that has just suggested a possibly more likely candidate - the host OS. As mentioned I occasionally got crashes so might have got them also between updating the host and installing the WSL kernel. Others might have the same host kernel too though! |
When the kernel mapping was moved the last 2GB of the address space, (__va(PFN_PHYS(max_low_pfn))) is much smaller than the .data section start address, the last set_memory_nx() in protect_kernel_text_data() will fail, thus the .data section is still mapped as W+X. This results in below W+X mapping waring at boot. Fix it by passing the correct .data section page num to the set_memory_nx(). [ 0.396516] ------------[ cut here ]------------ [ 0.396889] riscv/mm: Found insecure W+X mapping at address (____ptrval____)/0xffffffff80c00000 [ 0.398347] WARNING: CPU: 0 PID: 1 at arch/riscv/mm/ptdump.c:258 note_page+0x244/0x24a [ 0.398964] Modules linked in: [ 0.399459] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.13.0-rc1+ #14 [ 0.400003] Hardware name: riscv-virtio,qemu (DT) [ 0.400591] epc : note_page+0x244/0x24a [ 0.401368] ra : note_page+0x244/0x24a [ 0.401772] epc : ffffffff80007c86 ra : ffffffff80007c86 sp : ffffffe000e7bc30 [ 0.402304] gp : ffffffff80caae88 tp : ffffffe000e70000 t0 : ffffffff80cb80cf [ 0.402800] t1 : ffffffff80cb80c0 t2 : 0000000000000000 s0 : ffffffe000e7bc80 [ 0.403310] s1 : ffffffe000e7bde8 a0 : 0000000000000053 a1 : ffffffff80c83ff0 [ 0.403805] a2 : 0000000000000010 a3 : 0000000000000000 a4 : 6c7e7a5137233100 [ 0.404298] a5 : 6c7e7a5137233100 a6 : 0000000000000030 a7 : ffffffffffffffff [ 0.404849] s2 : ffffffff80e00000 s3 : 0000000040000000 s4 : 0000000000000000 [ 0.405393] s5 : 0000000000000000 s6 : 0000000000000003 s7 : ffffffe000e7bd48 [ 0.405935] s8 : ffffffff81000000 s9 : ffffffffc0000000 s10: ffffffe000e7bd48 [ 0.406476] s11: 0000000000001000 t3 : 0000000000000072 t4 : ffffffffffffffff [ 0.407016] t5 : 0000000000000002 t6 : ffffffe000e7b978 [ 0.407435] status: 0000000000000120 badaddr: 0000000000000000 cause: 0000000000000003 [ 0.408052] Call Trace: [ 0.408343] [<ffffffff80007c86>] note_page+0x244/0x24a [ 0.408855] [<ffffffff8010c5a6>] ptdump_hole+0x14/0x1e [ 0.409263] [<ffffffff800f65c6>] walk_pgd_range+0x2a0/0x376 [ 0.409690] [<ffffffff800f6828>] walk_page_range_novma+0x4e/0x6e [ 0.410146] [<ffffffff8010c5f8>] ptdump_walk_pgd+0x48/0x78 [ 0.410570] [<ffffffff80007d66>] ptdump_check_wx+0xb4/0xf8 [ 0.410990] [<ffffffff80006738>] mark_rodata_ro+0x26/0x2e [ 0.411407] [<ffffffff8031961e>] kernel_init+0x44/0x108 [ 0.411814] [<ffffffff80002312>] ret_from_exception+0x0/0xc [ 0.412309] ---[ end trace 7ec3459f2547ea83 ]--- [ 0.413141] Checked W+X mappings: failed, 512 W+X pages found Fixes: 2bfc6cd ("riscv: Move kernel mapping outside of linear mapping") Signed-off-by: Jisheng Zhang <[email protected]> Signed-off-by: Palmer Dabbelt <[email protected]>
ASan reports a heap-buffer-overflow in elf_sec__is_text when using perf-top. The bug is caused by the fact that secstrs is built from runtime_ss, while shdr is built from syms_ss if shdr.sh_type != SHT_NOBITS. Therefore, they point to two different ELF files. This patch renames secstrs to secstrs_run and adds secstrs_sym, so that the correct secstrs is chosen depending on shdr.sh_type. $ ASAN_OPTIONS=abort_on_error=1:disable_coredump=0:unmap_shadow_on_exit=1 ./perf top ================================================================= ==363148==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x61300009add6 at pc 0x00000049875c bp 0x7f4f56446440 sp 0x7f4f56445bf0 READ of size 1 at 0x61300009add6 thread T6 #0 0x49875b in StrstrCheck(void*, char*, char const*, char const*) (/home/user/linux/tools/perf/perf+0x49875b) #1 0x4d13a2 in strstr (/home/user/linux/tools/perf/perf+0x4d13a2) #2 0xacae36 in elf_sec__is_text /home/user/linux/tools/perf/util/symbol-elf.c:176:9 #3 0xac3ec9 in elf_sec__filter /home/user/linux/tools/perf/util/symbol-elf.c:187:9 #4 0xac2c3d in dso__load_sym /home/user/linux/tools/perf/util/symbol-elf.c:1254:20 #5 0x883981 in dso__load /home/user/linux/tools/perf/util/symbol.c:1897:9 #6 0x8e6248 in map__load /home/user/linux/tools/perf/util/map.c:332:7 #7 0x8e66e5 in map__find_symbol /home/user/linux/tools/perf/util/map.c:366:6 #8 0x7f8278 in machine__resolve /home/user/linux/tools/perf/util/event.c:707:13 #9 0x5f3d1a in perf_event__process_sample /home/user/linux/tools/perf/builtin-top.c:773:6 #10 0x5f30e4 in deliver_event /home/user/linux/tools/perf/builtin-top.c:1197:3 #11 0x908a72 in do_flush /home/user/linux/tools/perf/util/ordered-events.c:244:9 #12 0x905fae in __ordered_events__flush /home/user/linux/tools/perf/util/ordered-events.c:323:8 #13 0x9058db in ordered_events__flush /home/user/linux/tools/perf/util/ordered-events.c:341:9 #14 0x5f19b1 in process_thread /home/user/linux/tools/perf/builtin-top.c:1109:7 #15 0x7f4f6a21a298 in start_thread /usr/src/debug/glibc-2.33-16.fc34.x86_64/nptl/pthread_create.c:481:8 #16 0x7f4f697d0352 in clone ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 0x61300009add6 is located 10 bytes to the right of 332-byte region [0x61300009ac80,0x61300009adcc) allocated by thread T6 here: #0 0x4f3f7f in malloc (/home/user/linux/tools/perf/perf+0x4f3f7f) #1 0x7f4f6a0a88d9 (/lib64/libelf.so.1+0xa8d9) Thread T6 created by T0 here: #0 0x464856 in pthread_create (/home/user/linux/tools/perf/perf+0x464856) #1 0x5f06e0 in __cmd_top /home/user/linux/tools/perf/builtin-top.c:1309:6 #2 0x5ef19f in cmd_top /home/user/linux/tools/perf/builtin-top.c:1762:11 #3 0x7b28c0 in run_builtin /home/user/linux/tools/perf/perf.c:313:11 #4 0x7b119f in handle_internal_command /home/user/linux/tools/perf/perf.c:365:8 #5 0x7b2423 in run_argv /home/user/linux/tools/perf/perf.c:409:2 #6 0x7b0c19 in main /home/user/linux/tools/perf/perf.c:539:3 #7 0x7f4f696f7b74 in __libc_start_main /usr/src/debug/glibc-2.33-16.fc34.x86_64/csu/../csu/libc-start.c:332:16 SUMMARY: AddressSanitizer: heap-buffer-overflow (/home/user/linux/tools/perf/perf+0x49875b) in StrstrCheck(void*, char*, char const*, char const*) Shadow bytes around the buggy address: 0x0c268000b560: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c268000b570: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c268000b580: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c268000b590: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0c268000b5a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 =>0x0c268000b5b0: 00 00 00 00 00 00 00 00 00 04[fa]fa fa fa fa fa 0x0c268000b5c0: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00 0x0c268000b5d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0c268000b5e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0c268000b5f0: 07 fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c268000b600: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb Shadow gap: cc ==363148==ABORTING Suggested-by: Jiri Slaby <[email protected]> Signed-off-by: Riccardo Mancini <[email protected]> Acked-by: Namhyung Kim <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Fabian Hemmer <[email protected]> Cc: Ian Rogers <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Jiri Slaby <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Remi Bernon <[email protected]> Link: http://lore.kernel.org/lkml/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
The ADC controller on the board is fed by a 2.5V reference voltage. By default the channels #14 and #15 are dedicated to analog input (marked AN on the board), on the connectors mikrobus1 and mikrobus2. Signed-off-by: Eugen Hristev <[email protected]> Signed-off-by: Nicolas Ferre <[email protected]> Link: https://lore.kernel.org/r/[email protected]
It's later supposed to be either a correct address or NULL. Without the initialization, it may contain an undefined value which results in the following segmentation fault: # perf top --sort comm -g --ignore-callees=do_idle terminates with: #0 0x00007ffff56b7685 in __strlen_avx2 () from /lib64/libc.so.6 #1 0x00007ffff55e3802 in strdup () from /lib64/libc.so.6 #2 0x00005555558cb139 in hist_entry__init (callchain_size=<optimized out>, sample_self=true, template=0x7fffde7fb110, he=0x7fffd801c250) at util/hist.c:489 #3 hist_entry__new (template=template@entry=0x7fffde7fb110, sample_self=sample_self@entry=true) at util/hist.c:564 #4 0x00005555558cb4ba in hists__findnew_entry (hists=hists@entry=0x5555561d9e38, entry=entry@entry=0x7fffde7fb110, al=al@entry=0x7fffde7fb420, sample_self=sample_self@entry=true) at util/hist.c:657 #5 0x00005555558cba1b in __hists__add_entry (hists=hists@entry=0x5555561d9e38, al=0x7fffde7fb420, sym_parent=<optimized out>, bi=bi@entry=0x0, mi=mi@entry=0x0, sample=sample@entry=0x7fffde7fb4b0, sample_self=true, ops=0x0, block_info=0x0) at util/hist.c:288 #6 0x00005555558cbb70 in hists__add_entry (sample_self=true, sample=0x7fffde7fb4b0, mi=0x0, bi=0x0, sym_parent=<optimized out>, al=<optimized out>, hists=0x5555561d9e38) at util/hist.c:1056 #7 iter_add_single_cumulative_entry (iter=0x7fffde7fb460, al=<optimized out>) at util/hist.c:1056 #8 0x00005555558cc8a4 in hist_entry_iter__add (iter=iter@entry=0x7fffde7fb460, al=al@entry=0x7fffde7fb420, max_stack_depth=<optimized out>, arg=arg@entry=0x7fffffff7db0) at util/hist.c:1231 #9 0x00005555557cdc9a in perf_event__process_sample (machine=<optimized out>, sample=0x7fffde7fb4b0, evsel=<optimized out>, event=<optimized out>, tool=0x7fffffff7db0) at builtin-top.c:842 #10 deliver_event (qe=<optimized out>, qevent=<optimized out>) at builtin-top.c:1202 #11 0x00005555558a9318 in do_flush (show_progress=false, oe=0x7fffffff80e0) at util/ordered-events.c:244 #12 __ordered_events__flush (oe=oe@entry=0x7fffffff80e0, how=how@entry=OE_FLUSH__TOP, timestamp=timestamp@entry=0) at util/ordered-events.c:323 #13 0x00005555558a9789 in __ordered_events__flush (timestamp=<optimized out>, how=<optimized out>, oe=<optimized out>) at util/ordered-events.c:339 #14 ordered_events__flush (how=OE_FLUSH__TOP, oe=0x7fffffff80e0) at util/ordered-events.c:341 #15 ordered_events__flush (oe=oe@entry=0x7fffffff80e0, how=how@entry=OE_FLUSH__TOP) at util/ordered-events.c:339 #16 0x00005555557cd631 in process_thread (arg=0x7fffffff7db0) at builtin-top.c:1114 #17 0x00007ffff7bb817a in start_thread () from /lib64/libpthread.so.0 #18 0x00007ffff5656dc3 in clone () from /lib64/libc.so.6 If you look at the frame #2, the code is: 488 if (he->srcline) { 489 he->srcline = strdup(he->srcline); 490 if (he->srcline == NULL) 491 goto err_rawdata; 492 } If he->srcline is not NULL (it is not NULL if it is uninitialized rubbish), it gets strdupped and strdupping a rubbish random string causes the problem. Also, if you look at the commit 1fb7d06, it adds the srcline property into the struct, but not initializing it everywhere needed. Committer notes: Now I see, when using --ignore-callees=do_idle we end up here at line 2189 in add_callchain_ip(): 2181 if (al.sym != NULL) { 2182 if (perf_hpp_list.parent && !*parent && 2183 symbol__match_regex(al.sym, &parent_regex)) 2184 *parent = al.sym; 2185 else if (have_ignore_callees && root_al && 2186 symbol__match_regex(al.sym, &ignore_callees_regex)) { 2187 /* Treat this symbol as the root, 2188 forgetting its callees. */ 2189 *root_al = al; 2190 callchain_cursor_reset(cursor); 2191 } 2192 } And the al that doesn't have the ->srcline field initialized will be copied to the root_al, so then, back to: 1211 int hist_entry_iter__add(struct hist_entry_iter *iter, struct addr_location *al, 1212 int max_stack_depth, void *arg) 1213 { 1214 int err, err2; 1215 struct map *alm = NULL; 1216 1217 if (al) 1218 alm = map__get(al->map); 1219 1220 err = sample__resolve_callchain(iter->sample, &callchain_cursor, &iter->parent, 1221 iter->evsel, al, max_stack_depth); 1222 if (err) { 1223 map__put(alm); 1224 return err; 1225 } 1226 1227 err = iter->ops->prepare_entry(iter, al); 1228 if (err) 1229 goto out; 1230 1231 err = iter->ops->add_single_entry(iter, al); 1232 if (err) 1233 goto out; 1234 That al at line 1221 is what hist_entry_iter__add() (called from sample__resolve_callchain()) saw as 'root_al', and then: iter->ops->add_single_entry(iter, al); will go on with al->srcline with a bogus value, I'll add the above sequence to the cset and apply, thanks! Signed-off-by: Michael Petlan <[email protected]> CC: Milian Wolff <[email protected]> Cc: Jiri Olsa <[email protected]> Fixes: 1fb7d06 ("perf report Use srcline from callchain for hist entries") Link: https //lore.kernel.org/r/[email protected] Reported-by: Juri Lelli <[email protected]> Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
Ido Schimmel says: ==================== mlxsw: Add support for IP-in-IP with IPv6 underlay Currently, mlxsw only supports IP-in-IP with IPv4 underlay. Traffic routed through 'gre' netdevs is encapsulated with IPv4 and GRE headers. Similarly, incoming IPv4 GRE packets are decapsulated and routed in the overlay VRF (which can be the same as the underlay VRF). This patchset adds support for IPv6 underlay using the 'ip6gre' netdev. Due to architectural differences between Spectrum-1 and later ASICs, this functionality is only supported on Spectrum-2 onwards (the software data path is used for Spectrum-1). Patchset overview: Patches #1-#5 are preparations. Patches #6-#9 add and extend required device registers. Patches #10-#14 gradually add IPv6 underlay support. A follow-up patchset will add net/forwarding/ selftests. ==================== Signed-off-by: David S. Miller <[email protected]>
Attempting to defragment a Btrfs file containing a transparent huge page immediately deadlocks with the following stack trace: #0 context_switch (kernel/sched/core.c:4940:2) #1 __schedule (kernel/sched/core.c:6287:8) #2 schedule (kernel/sched/core.c:6366:3) #3 io_schedule (kernel/sched/core.c:8389:2) #4 wait_on_page_bit_common (mm/filemap.c:1356:4) #5 __lock_page (mm/filemap.c:1648:2) #6 lock_page (./include/linux/pagemap.h:625:3) #7 pagecache_get_page (mm/filemap.c:1910:4) #8 find_or_create_page (./include/linux/pagemap.h:420:9) #9 defrag_prepare_one_page (fs/btrfs/ioctl.c:1068:9) #10 defrag_one_range (fs/btrfs/ioctl.c:1326:14) #11 defrag_one_cluster (fs/btrfs/ioctl.c:1421:9) #12 btrfs_defrag_file (fs/btrfs/ioctl.c:1523:9) #13 btrfs_ioctl_defrag (fs/btrfs/ioctl.c:3117:9) #14 btrfs_ioctl (fs/btrfs/ioctl.c:4872:10) #15 vfs_ioctl (fs/ioctl.c:51:10) #16 __do_sys_ioctl (fs/ioctl.c:874:11) #17 __se_sys_ioctl (fs/ioctl.c:860:1) #18 __x64_sys_ioctl (fs/ioctl.c:860:1) #19 do_syscall_x64 (arch/x86/entry/common.c:50:14) #20 do_syscall_64 (arch/x86/entry/common.c:80:7) #21 entry_SYSCALL_64+0x7c/0x15b (arch/x86/entry/entry_64.S:113) A huge page is represented by a compound page, which consists of a struct page for each PAGE_SIZE page within the huge page. The first struct page is the "head page", and the remaining are "tail pages". Defragmentation attempts to lock each page in the range. However, lock_page() on a tail page actually locks the corresponding head page. So, if defragmentation tries to lock more than one struct page in a compound page, it tries to lock the same head page twice and deadlocks with itself. Ideally, we should be able to defragment transparent huge pages. However, THP for filesystems is currently read-only, so a lot of code is not ready to use huge pages for I/O. For now, let's just return ETXTBUSY. This can be reproduced with the following on a kernel with CONFIG_READ_ONLY_THP_FOR_FS=y: $ cat create_thp_file.c #include <fcntl.h> #include <stdbool.h> #include <stdio.h> #include <stdint.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> static const char zeroes[1024 * 1024]; static const size_t FILE_SIZE = 2 * 1024 * 1024; int main(int argc, char **argv) { if (argc != 2) { fprintf(stderr, "usage: %s PATH\n", argv[0]); return EXIT_FAILURE; } int fd = creat(argv[1], 0777); if (fd == -1) { perror("creat"); return EXIT_FAILURE; } size_t written = 0; while (written < FILE_SIZE) { ssize_t ret = write(fd, zeroes, sizeof(zeroes) < FILE_SIZE - written ? sizeof(zeroes) : FILE_SIZE - written); if (ret < 0) { perror("write"); return EXIT_FAILURE; } written += ret; } close(fd); fd = open(argv[1], O_RDONLY); if (fd == -1) { perror("open"); return EXIT_FAILURE; } /* * Reserve some address space so that we can align the file mapping to * the huge page size. */ void *placeholder_map = mmap(NULL, FILE_SIZE * 2, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (placeholder_map == MAP_FAILED) { perror("mmap (placeholder)"); return EXIT_FAILURE; } void *aligned_address = (void *)(((uintptr_t)placeholder_map + FILE_SIZE - 1) & ~(FILE_SIZE - 1)); void *map = mmap(aligned_address, FILE_SIZE, PROT_READ | PROT_EXEC, MAP_SHARED | MAP_FIXED, fd, 0); if (map == MAP_FAILED) { perror("mmap"); return EXIT_FAILURE; } if (madvise(map, FILE_SIZE, MADV_HUGEPAGE) < 0) { perror("madvise"); return EXIT_FAILURE; } char *line = NULL; size_t line_capacity = 0; FILE *smaps_file = fopen("/proc/self/smaps", "r"); if (!smaps_file) { perror("fopen"); return EXIT_FAILURE; } for (;;) { for (size_t off = 0; off < FILE_SIZE; off += 4096) ((volatile char *)map)[off]; ssize_t ret; bool this_mapping = false; while ((ret = getline(&line, &line_capacity, smaps_file)) > 0) { unsigned long start, end, huge; if (sscanf(line, "%lx-%lx", &start, &end) == 2) { this_mapping = (start <= (uintptr_t)map && (uintptr_t)map < end); } else if (this_mapping && sscanf(line, "FilePmdMapped: %ld", &huge) == 1 && huge > 0) { return EXIT_SUCCESS; } } sleep(6); rewind(smaps_file); fflush(smaps_file); } } $ ./create_thp_file huge $ btrfs fi defrag -czstd ./huge Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Omar Sandoval <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
The exit function fixes a memory leak with the src field as detected by leak sanitizer. An example of which is: Indirect leak of 25133184 byte(s) in 207 object(s) allocated from: #0 0x7f199ecfe987 in __interceptor_calloc libsanitizer/asan/asan_malloc_linux.cpp:154 #1 0x55defe638224 in annotated_source__alloc_histograms util/annotate.c:803 #2 0x55defe6397e4 in symbol__hists util/annotate.c:952 #3 0x55defe639908 in symbol__inc_addr_samples util/annotate.c:968 #4 0x55defe63aa29 in hist_entry__inc_addr_samples util/annotate.c:1119 #5 0x55defe499a79 in hist_iter__report_callback tools/perf/builtin-report.c:182 #6 0x55defe7a859d in hist_entry_iter__add util/hist.c:1236 #7 0x55defe49aa63 in process_sample_event tools/perf/builtin-report.c:315 #8 0x55defe731bc8 in evlist__deliver_sample util/session.c:1473 #9 0x55defe731e38 in machines__deliver_event util/session.c:1510 #10 0x55defe732a23 in perf_session__deliver_event util/session.c:1590 #11 0x55defe72951e in ordered_events__deliver_event util/session.c:183 #12 0x55defe740082 in do_flush util/ordered-events.c:244 #13 0x55defe7407cb in __ordered_events__flush util/ordered-events.c:323 #14 0x55defe740a61 in ordered_events__flush util/ordered-events.c:341 #15 0x55defe73837f in __perf_session__process_events util/session.c:2390 #16 0x55defe7385ff in perf_session__process_events util/session.c:2420 ... Signed-off-by: Ian Rogers <[email protected]> Acked-by: Namhyung Kim <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: James Clark <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kajol Jain <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Martin Liška <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephane Eranian <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
Hopefully things have been better. I am going to close this up for now, as I doubt this is strictly related to the kernel and even if it is, I do not have the tools to debug what is going on as WSL is a bit of a black box. |
Ido Schimmel says: ==================== HW counters for soft devices Petr says: Offloading switch device drivers may be able to collect statistics of the traffic taking place in the HW datapath that pertains to a certain soft netdevice, such as a VLAN. In this patch set, add the necessary infrastructure to allow exposing these statistics to the offloaded netdevice in question, and add mlxsw offload. Across HW platforms, the counter itself very likely constitutes a limited resource, and the act of counting may have a performance impact. Therefore this patch set makes the HW statistics collection opt-in and togglable from userspace on a per-netdevice basis. Additionally, HW devices may have various limiting conditions under which they can realize the counter. Therefore it is also possible to query whether the requested counter is realized by any driver. In TC parlance, which is to a degree reused in this patch set, two values are recognized: "request" tracks whether the user enabled collecting HW statistics, and "used" tracks whether any HW statistics are actually collected. In the past, this author has expressed the opinion that `a typical user doing "ip -s l sh", including various scripts, wants to see the full picture and not worry what's going on where'. While that would be nice, unfortunately it cannot work: - Packets that trap from the HW datapath to the SW datapath would be double counted. For a given netdevice, some traffic can be purely a SW artifact, and some may flow through the HW object corresponding to the netdevice. But some traffic can also get trapped to the SW datapath after bumping the HW counter. It is not clear how to make sure double-counting does not occur in the SW datapath in that case, while still making sure that possibly divergent SW forwarding path gets bumped as appropriate. So simply adding HW and SW stats may work roughly, most of the time, but there are scenarios where the result is nonsensical. - HW devices will have limitations as to what type of traffic they can count. In case of mlxsw, which is part of this patch set, there is no reasonable way to count all traffic going through a certain netdevice, such as a VLAN netdevice enslaved to a bridge. It is however very simple to count traffic flowing through an L3 object, such as a VLAN netdevice with an IP address. Similarly for physical netdevices, the L3 object at which the counter is installed is the subport carrying untagged traffic. These are not "just counters". It is important that the user understands what is being counted. It would be incorrect to conflate these statistics with another existing statistics suite. To that end, this patch set introduces a statistics suite called "L3 stats". This label should make it easy to understand what is being counted, and to decide whether a given device can or cannot implement this suite for some type of netdevice. At the same time, the code is written to make future extensions easy, should a device pop up that can implement a different flavor of statistics suite (say L2, or an address-family-specific suite). For example, using a work-in-progress iproute2[1], to turn on and then list the counters on a VLAN netdevice: # ip stats set dev swp1.200 l3_stats on # ip stats show dev swp1.200 group offload subgroup l3_stats 56: swp1.200: group offload subgroup l3_stats on used on RX: bytes packets errors dropped missed mcast 0 0 0 0 0 0 TX: bytes packets errors dropped carrier collsns 0 0 0 0 0 0 The patchset progresses as follows: - Patch #1 is a cleanup. - In patch #2, remove the assumption that all LINK_OFFLOAD_XSTATS are dev-backed. The only attribute defined under the nest is currently IFLA_OFFLOAD_XSTATS_CPU_HIT. L3_STATS differs from CPU_HIT in that the driver that supplies the statistics is not the same as the driver that implements the netdevice. Make the code compatible with this in patch #2. - In patch #3, add the possibility to filter inside nests. The filter_mask field of RTM_GETSTATS header determines which top-level attributes should be included in the netlink response. This saves processing time by only including the bits that the user cares about instead of always dumping everything. This is doubly important for HW-backed statistics that would typically require a trip to the device to fetch the stats. In this patch, the UAPI is extended to allow filtering inside IFLA_STATS_LINK_OFFLOAD_XSTATS in particular, but the scheme is easily extensible to other nests as well. - In patch #4, propagate extack where we need it. In patch #5, make it possible to propagate errors from drivers to the user. - In patch #6, add the in-kernel APIs for keeping track of the new stats suite, and the notifiers that the core uses to communicate with the drivers. - In patch #7, add UAPI for obtaining the new stats suite. - In patch #8, add a new UAPI message, RTM_SETSTATS, which will carry the message to toggle the newly-added stats suite. In patch #9, add the toggle itself. At this point the core is ready for drivers to add support for the new stats suite. - In patches #10, #11 and #12, apply small tweaks to mlxsw code. - In patch #13, add support for L3 stats, which are realized as RIF counters. - Finally in patch #14, a selftest is added to the net/forwarding directory. Technically this is a HW-specific test, in that without a HW implementing the counters, it just will not pass. But devices that support L3 statistics at all are likely to be able to reuse this selftest, so it seems appropriate to put it in the general forwarding directory. We also have a netdevsim implementation, and a corresponding selftest that verifies specifically some of the core code. We intend to contribute these later. Interested parties can take a look at the raw code at [2]. [1] https://github.com/pmachata/iproute2/commits/soft_counters [2] https://github.com/pmachata/linux_mlxsw/commits/petrm_soft_counters_2 v2: - Patch #3: - Do not declare strict_start_type at the new policies, since they are used with nla_parse_nested() (sans _deprecated). - Use NLA_POLICY_NESTED to declare what the nest contents should be - Use NLA_POLICY_MASK instead of BITFIELD32 for the filtering attribute. - Patch #6: - s/monotonous/monotonic/ in commit message - Use a newly-added struct rtnl_hw_stats64 for stats transfer - Patch #7: - Use a newly-added struct rtnl_hw_stats64 for stats transfer - Patch #8: - Do not declare strict_start_type at the new policies, since they are used with nla_parse_nested() (sans _deprecated). - Patch #13: - Use a newly-added struct rtnl_hw_stats64 for stats transfer ==================== Signed-off-by: David S. Miller <[email protected]>
…ATE_QUEUED If the callback 'start_streaming' fails, then all queued buffers in the driver should be returned with state 'VB2_BUF_STATE_QUEUED'. Currently, they are returned with 'VB2_BUF_STATE_ERROR' which is wrong. Fix this. This also fixes the warning: [ 65.583633] WARNING: CPU: 5 PID: 593 at drivers/media/common/videobuf2/videobuf2-core.c:1612 vb2_start_streaming+0xd4/0x160 [videobuf2_common] [ 65.585027] Modules linked in: snd_usb_audio snd_hwdep snd_usbmidi_lib snd_rawmidi snd_soc_hdmi_codec dw_hdmi_i2s_audio saa7115 stk1160 videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common videodev mc crct10dif_ce panfrost snd_soc_simple_card snd_soc_audio_graph_card snd_soc_spdif_tx snd_soc_simple_card_utils gpu_sched phy_rockchip_pcie snd_soc_rockchip_i2s rockchipdrm analogix_dp dw_mipi_dsi dw_hdmi cec drm_kms_helper drm rtc_rk808 rockchip_saradc industrialio_triggered_buffer kfifo_buf rockchip_thermal pcie_rockchip_host ip_tables x_tables ipv6 [ 65.589383] CPU: 5 PID: 593 Comm: v4l2src0:src Tainted: G W 5.16.0-rc4-62408-g32447129cb30-dirty #14 [ 65.590293] Hardware name: Radxa ROCK Pi 4B (DT) [ 65.590696] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 65.591304] pc : vb2_start_streaming+0xd4/0x160 [videobuf2_common] [ 65.591850] lr : vb2_start_streaming+0x6c/0x160 [videobuf2_common] [ 65.592395] sp : ffff800012bc3ad0 [ 65.592685] x29: ffff800012bc3ad0 x28: 0000000000000000 x27: ffff800012bc3cd8 [ 65.593312] x26: 0000000000000000 x25: ffff00000d8a7800 x24: 0000000040045612 [ 65.593938] x23: ffff800011323000 x22: ffff800012bc3cd8 x21: ffff00000908a8b0 [ 65.594562] x20: ffff00000908a8c8 x19: 00000000fffffff4 x18: ffffffffffffffff [ 65.595188] x17: 000000040044ffff x16: 00400034b5503510 x15: ffff800011323f78 [ 65.595813] x14: ffff000013163886 x13: ffff000013163885 x12: 00000000000002ce [ 65.596439] x11: 0000000000000028 x10: 0000000000000001 x9 : 0000000000000228 [ 65.597064] x8 : 0101010101010101 x7 : 7f7f7f7f7f7f7f7f x6 : fefefeff726c5e78 [ 65.597690] x5 : ffff800012bc3990 x4 : 0000000000000000 x3 : ffff000009a34880 [ 65.598315] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff000007cd99f0 [ 65.598940] Call trace: [ 65.599155] vb2_start_streaming+0xd4/0x160 [videobuf2_common] [ 65.599672] vb2_core_streamon+0x17c/0x1a8 [videobuf2_common] [ 65.600179] vb2_streamon+0x54/0x88 [videobuf2_v4l2] [ 65.600619] vb2_ioctl_streamon+0x54/0x60 [videobuf2_v4l2] [ 65.601103] v4l_streamon+0x3c/0x50 [videodev] [ 65.601521] __video_do_ioctl+0x1a4/0x428 [videodev] [ 65.601977] video_usercopy+0x320/0x828 [videodev] [ 65.602419] video_ioctl2+0x3c/0x58 [videodev] [ 65.602830] v4l2_ioctl+0x60/0x90 [videodev] [ 65.603227] __arm64_sys_ioctl+0xa8/0xe0 [ 65.603576] invoke_syscall+0x54/0x118 [ 65.603911] el0_svc_common.constprop.3+0x84/0x100 [ 65.604332] do_el0_svc+0x34/0xa0 [ 65.604625] el0_svc+0x1c/0x50 [ 65.604897] el0t_64_sync_handler+0x88/0xb0 [ 65.605264] el0t_64_sync+0x16c/0x170 [ 65.605587] ---[ end trace 578e0ba07742170d ]--- Fixes: 8ac4564 ("[media] stk1160: Stop device and unqueue buffers when start_streaming() fails") Signed-off-by: Dafna Hirschfeld <[email protected]> Reviewed-by: Ezequiel Garcia <[email protected]> Signed-off-by: Hans Verkuil <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
In remove_phb_dynamic() we use &phb->io_resource, after we've called device_unregister(&host_bridge->dev). But the unregister may have freed phb, because pcibios_free_controller_deferred() is the release function for the host_bridge. If there are no outstanding references when we call device_unregister() then phb will be freed out from under us. This has gone mainly unnoticed, but with slub_debug and page_poison enabled it can lead to a crash: PID: 7574 TASK: c0000000d492cb80 CPU: 13 COMMAND: "drmgr" #0 [c0000000e4f075a0] crash_kexec at c00000000027d7dc #1 [c0000000e4f075d0] oops_end at c000000000029608 #2 [c0000000e4f07650] __bad_page_fault at c0000000000904b4 #3 [c0000000e4f076c0] do_bad_slb_fault at c00000000009a5a8 #4 [c0000000e4f076f0] data_access_slb_common_virt at c000000000008b30 Data SLB Access [380] exception frame: R0: c000000000167250 R1: c0000000e4f07a00 R2: c000000002a46100 R3: c000000002b39ce8 R4: 00000000000000c0 R5: 00000000000000a9 R6: 3894674d000000c0 R7: 0000000000000000 R8: 00000000000000ff R9: 0000000000000100 R10: 6b6b6b6b6b6b6b6b R11: 0000000000008000 R12: c00000000023da80 R13: c0000009ffd38b00 R14: 0000000000000000 R15: 000000011c87f0f0 R16: 0000000000000006 R17: 0000000000000003 R18: 0000000000000002 R19: 0000000000000004 R20: 0000000000000005 R21: 000000011c87ede8 R22: 000000011c87c5a8 R23: 000000011c87d3a0 R24: 0000000000000000 R25: 0000000000000001 R26: c0000000e4f07cc8 R27: c00000004d1cc400 R28: c0080000031d00e8 R29: c00000004d23d800 R30: c00000004d1d2400 R31: c00000004d1d2540 NIP: c000000000167258 MSR: 8000000000009033 OR3: c000000000e9f474 CTR: 0000000000000000 LR: c000000000167250 XER: 0000000020040003 CCR: 0000000024088420 MQ: 0000000000000000 DAR: 6b6b6b6b6b6b6ba3 DSISR: c0000000e4f07920 Syscall Result: fffffffffffffff2 [NIP : release_resource+56] [LR : release_resource+48] #5 [c0000000e4f07a00] release_resource at c000000000167258 (unreliable) #6 [c0000000e4f07a30] remove_phb_dynamic at c000000000105648 #7 [c0000000e4f07ab0] dlpar_remove_slot at c0080000031a09e8 [rpadlpar_io] #8 [c0000000e4f07b50] remove_slot_store at c0080000031a0b9c [rpadlpar_io] #9 [c0000000e4f07be0] kobj_attr_store at c000000000817d8c #10 [c0000000e4f07c00] sysfs_kf_write at c00000000063e504 #11 [c0000000e4f07c20] kernfs_fop_write_iter at c00000000063d868 #12 [c0000000e4f07c70] new_sync_write at c00000000054339c #13 [c0000000e4f07d10] vfs_write at c000000000546624 #14 [c0000000e4f07d60] ksys_write at c0000000005469f4 #15 [c0000000e4f07db0] system_call_exception at c000000000030840 #16 [c0000000e4f07e10] system_call_vectored_common at c00000000000c168 To avoid it, we can take a reference to the host_bridge->dev until we're done using phb. Then when we drop the reference the phb will be freed. Fixes: 2dd9c11 ("powerpc/pseries: use pci_host_bridge.release_fn() to kfree(phb)") Reported-by: David Dai <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Tested-by: Sachin Sant <[email protected]> Link: https://lore.kernel.org/r/[email protected]
…ne() failed Patch series "mm: COW fixes part 2: reliable GUP pins of anonymous pages", v3. This series is the result of the discussion on the previous approach [2]. More information on the general COW issues can be found there. It is based on latest linus/master (post v5.17, with relevant core-MM changes for v5.18-rc1). This series fixes memory corruptions when a GUP pin (FOLL_PIN) was taken on an anonymous page and COW logic fails to detect exclusivity of the page to then replacing the anonymous page by a copy in the page table: The GUP pin lost synchronicity with the pages mapped into the page tables. This issue, including other related COW issues, has been summarized in [3] under 3): " 3. Intra Process Memory Corruptions due to Wrong COW (FOLL_PIN) page_maybe_dma_pinned() is used to check if a page may be pinned for DMA (using FOLL_PIN instead of FOLL_GET). While false positives are tolerable, false negatives are problematic: pages that are pinned for DMA must not be added to the swapcache. If it happens, the (now pinned) page could be faulted back from the swapcache into page tables read-only. Future write-access would detect the pinning and COW the page, losing synchronicity. For the interested reader, this is nicely documented in feb889f ("mm: don't put pinned pages into the swap cache"). Peter reports [8] that page_maybe_dma_pinned() as used is racy in some cases and can result in a violation of the documented semantics: giving false negatives because of the race. There are cases where we call it without properly taking a per-process sequence lock, turning the usage of page_maybe_dma_pinned() racy. While one case (clear_refs SOFTDIRTY tracking, see below) seems to be easy to handle, there is especially one rmap case (shrink_page_list) that's hard to fix: in the rmap world, we're not limited to a single process. The shrink_page_list() issue is really subtle. If we race with someone pinning a page, we can trigger the same issue as in the FOLL_GET case. See the detail section at the end of this mail on a discussion how bad this can bite us with VFIO or other FOLL_PIN user. It's harder to reproduce, but I managed to modify the O_DIRECT reproducer to use io_uring fixed buffers [15] instead, which ends up using FOLL_PIN | FOLL_WRITE | FOLL_LONGTERM to pin buffer pages and can similarly trigger a loss of synchronicity and consequently a memory corruption. Again, the root issue is that a write-fault on a page that has additional references results in a COW and thereby a loss of synchronicity and consequently a memory corruption if two parties believe they are referencing the same page. " This series makes GUP pins (R/O and R/W) on anonymous pages fully reliable, especially also taking care of concurrent pinning via GUP-fast, for example, also fully fixing an issue reported regarding NUMA balancing [4] recently. While doing that, it further reduces "unnecessary COWs", especially when we don't fork()/KSM and don't swapout, and fixes the COW security for hugetlb for FOLL_PIN. In summary, we track via a pageflag (PG_anon_exclusive) whether a mapped anonymous page is exclusive. Exclusive anonymous pages that are mapped R/O can directly be mapped R/W by the COW logic in the write fault handler. Exclusive anonymous pages that want to be shared (fork(), KSM) first have to be marked shared -- which will fail if there are GUP pins on the page. GUP is only allowed to take a pin on anonymous pages that are exclusive. The PT lock is the primary mechanism to synchronize modifications of PG_anon_exclusive. We synchronize against GUP-fast either via the src_mm->write_protect_seq (during fork()) or via clear/invalidate+flush of the relevant page table entry. Special care has to be taken about swap, migration, and THPs (whereby a PMD-mapping can be converted to a PTE mapping and we have to track information for subpages). Besides these, we let the rmap code handle most magic. For reliable R/O pins of anonymous pages, we need FAULT_FLAG_UNSHARE logic as part of our previous approach [2], however, it's now 100% mapcount free and I further simplified it a bit. #1 is a fix #3-#10 are mostly rmap preparations for PG_anon_exclusive handling #11 introduces PG_anon_exclusive #12 uses PG_anon_exclusive and make R/W pins of anonymous pages reliable #13 is a preparation for reliable R/O pins #14 and #15 is reused/modified GUP-triggered unsharing for R/O GUP pins make R/O pins of anonymous pages reliable #16 adds sanity check when (un)pinning anonymous pages [1] https://lkml.kernel.org/r/[email protected] [2] https://lkml.kernel.org/r/[email protected] [3] https://lore.kernel.org/r/[email protected] [4] https://bugzilla.kernel.org/show_bug.cgi?id=215616 This patch (of 16): In case arch_unmap_one() fails, we already did a swap_duplicate(). let's undo that properly via swap_free(). Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: ca827d5 ("mm, swap: Add infrastructure for saving page metadata on swap") Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Khalid Aziz <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: David Rientjes <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: John Hubbard <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Yang Shi <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Jann Horn <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Peter Xu <[email protected]> Cc: Don Dutile <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Jan Kara <[email protected]> Cc: Liang Zhang <[email protected]> Cc: Pedro Demarchi Gomes <[email protected]> Cc: Oded Gabbay <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Stephen Rothwell <[email protected]>
…ne() failed Patch series "mm: COW fixes part 2: reliable GUP pins of anonymous pages", v3. This series is the result of the discussion on the previous approach [2]. More information on the general COW issues can be found there. It is based on latest linus/master (post v5.17, with relevant core-MM changes for v5.18-rc1). This series fixes memory corruptions when a GUP pin (FOLL_PIN) was taken on an anonymous page and COW logic fails to detect exclusivity of the page to then replacing the anonymous page by a copy in the page table: The GUP pin lost synchronicity with the pages mapped into the page tables. This issue, including other related COW issues, has been summarized in [3] under 3): " 3. Intra Process Memory Corruptions due to Wrong COW (FOLL_PIN) page_maybe_dma_pinned() is used to check if a page may be pinned for DMA (using FOLL_PIN instead of FOLL_GET). While false positives are tolerable, false negatives are problematic: pages that are pinned for DMA must not be added to the swapcache. If it happens, the (now pinned) page could be faulted back from the swapcache into page tables read-only. Future write-access would detect the pinning and COW the page, losing synchronicity. For the interested reader, this is nicely documented in feb889f ("mm: don't put pinned pages into the swap cache"). Peter reports [8] that page_maybe_dma_pinned() as used is racy in some cases and can result in a violation of the documented semantics: giving false negatives because of the race. There are cases where we call it without properly taking a per-process sequence lock, turning the usage of page_maybe_dma_pinned() racy. While one case (clear_refs SOFTDIRTY tracking, see below) seems to be easy to handle, there is especially one rmap case (shrink_page_list) that's hard to fix: in the rmap world, we're not limited to a single process. The shrink_page_list() issue is really subtle. If we race with someone pinning a page, we can trigger the same issue as in the FOLL_GET case. See the detail section at the end of this mail on a discussion how bad this can bite us with VFIO or other FOLL_PIN user. It's harder to reproduce, but I managed to modify the O_DIRECT reproducer to use io_uring fixed buffers [15] instead, which ends up using FOLL_PIN | FOLL_WRITE | FOLL_LONGTERM to pin buffer pages and can similarly trigger a loss of synchronicity and consequently a memory corruption. Again, the root issue is that a write-fault on a page that has additional references results in a COW and thereby a loss of synchronicity and consequently a memory corruption if two parties believe they are referencing the same page. " This series makes GUP pins (R/O and R/W) on anonymous pages fully reliable, especially also taking care of concurrent pinning via GUP-fast, for example, also fully fixing an issue reported regarding NUMA balancing [4] recently. While doing that, it further reduces "unnecessary COWs", especially when we don't fork()/KSM and don't swapout, and fixes the COW security for hugetlb for FOLL_PIN. In summary, we track via a pageflag (PG_anon_exclusive) whether a mapped anonymous page is exclusive. Exclusive anonymous pages that are mapped R/O can directly be mapped R/W by the COW logic in the write fault handler. Exclusive anonymous pages that want to be shared (fork(), KSM) first have to be marked shared -- which will fail if there are GUP pins on the page. GUP is only allowed to take a pin on anonymous pages that are exclusive. The PT lock is the primary mechanism to synchronize modifications of PG_anon_exclusive. We synchronize against GUP-fast either via the src_mm->write_protect_seq (during fork()) or via clear/invalidate+flush of the relevant page table entry. Special care has to be taken about swap, migration, and THPs (whereby a PMD-mapping can be converted to a PTE mapping and we have to track information for subpages). Besides these, we let the rmap code handle most magic. For reliable R/O pins of anonymous pages, we need FAULT_FLAG_UNSHARE logic as part of our previous approach [2], however, it's now 100% mapcount free and I further simplified it a bit. #1 is a fix #3-#10 are mostly rmap preparations for PG_anon_exclusive handling #11 introduces PG_anon_exclusive #12 uses PG_anon_exclusive and make R/W pins of anonymous pages reliable #13 is a preparation for reliable R/O pins #14 and #15 is reused/modified GUP-triggered unsharing for R/O GUP pins make R/O pins of anonymous pages reliable #16 adds sanity check when (un)pinning anonymous pages [1] https://lkml.kernel.org/r/[email protected] [2] https://lkml.kernel.org/r/[email protected] [3] https://lore.kernel.org/r/[email protected] [4] https://bugzilla.kernel.org/show_bug.cgi?id=215616 This patch (of 16): In case arch_unmap_one() fails, we already did a swap_duplicate(). let's undo that properly via swap_free(). Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: ca827d5 ("mm, swap: Add infrastructure for saving page metadata on swap") Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Khalid Aziz <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: David Rientjes <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: John Hubbard <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Yang Shi <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Jann Horn <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Peter Xu <[email protected]> Cc: Don Dutile <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Jan Kara <[email protected]> Cc: Liang Zhang <[email protected]> Cc: Pedro Demarchi Gomes <[email protected]> Cc: Oded Gabbay <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Stephen Rothwell <[email protected]>
Ido Schimmel says: ==================== net/sched: Better error reporting for offload failures This patchset improves error reporting to user space when offload fails during the flow action setup phase. That is, when failures occur in the actions themselves, even before calling device drivers. Requested / reported in [1]. This is done by passing extack to the offload_act_setup() callback and making use of it in the various actions. Patches #1-#2 change matchall and flower to log error messages to user space in accordance with the verbose flag. Patch #3 passes extack to the offload_act_setup() callback from the various call sites, including matchall and flower. Patches #4-#11 make use of extack in the various actions to report offload failures. Patch #12 adds an error message when the action does not support offload at all. Patches #13-#14 change matchall and flower to stop overwriting more specific error messages. [1] https://lore.kernel.org/netdev/20220317185249.5mff5u2x624pjewv@skbuf/ ==================== Signed-off-by: David S. Miller <[email protected]>
…ne() failed Patch series "mm: COW fixes part 2: reliable GUP pins of anonymous pages", v3. This series is the result of the discussion on the previous approach [2]. More information on the general COW issues can be found there. It is based on latest linus/master (post v5.17, with relevant core-MM changes for v5.18-rc1). This series fixes memory corruptions when a GUP pin (FOLL_PIN) was taken on an anonymous page and COW logic fails to detect exclusivity of the page to then replacing the anonymous page by a copy in the page table: The GUP pin lost synchronicity with the pages mapped into the page tables. This issue, including other related COW issues, has been summarized in [3] under 3): " 3. Intra Process Memory Corruptions due to Wrong COW (FOLL_PIN) page_maybe_dma_pinned() is used to check if a page may be pinned for DMA (using FOLL_PIN instead of FOLL_GET). While false positives are tolerable, false negatives are problematic: pages that are pinned for DMA must not be added to the swapcache. If it happens, the (now pinned) page could be faulted back from the swapcache into page tables read-only. Future write-access would detect the pinning and COW the page, losing synchronicity. For the interested reader, this is nicely documented in feb889f ("mm: don't put pinned pages into the swap cache"). Peter reports [8] that page_maybe_dma_pinned() as used is racy in some cases and can result in a violation of the documented semantics: giving false negatives because of the race. There are cases where we call it without properly taking a per-process sequence lock, turning the usage of page_maybe_dma_pinned() racy. While one case (clear_refs SOFTDIRTY tracking, see below) seems to be easy to handle, there is especially one rmap case (shrink_page_list) that's hard to fix: in the rmap world, we're not limited to a single process. The shrink_page_list() issue is really subtle. If we race with someone pinning a page, we can trigger the same issue as in the FOLL_GET case. See the detail section at the end of this mail on a discussion how bad this can bite us with VFIO or other FOLL_PIN user. It's harder to reproduce, but I managed to modify the O_DIRECT reproducer to use io_uring fixed buffers [15] instead, which ends up using FOLL_PIN | FOLL_WRITE | FOLL_LONGTERM to pin buffer pages and can similarly trigger a loss of synchronicity and consequently a memory corruption. Again, the root issue is that a write-fault on a page that has additional references results in a COW and thereby a loss of synchronicity and consequently a memory corruption if two parties believe they are referencing the same page. " This series makes GUP pins (R/O and R/W) on anonymous pages fully reliable, especially also taking care of concurrent pinning via GUP-fast, for example, also fully fixing an issue reported regarding NUMA balancing [4] recently. While doing that, it further reduces "unnecessary COWs", especially when we don't fork()/KSM and don't swapout, and fixes the COW security for hugetlb for FOLL_PIN. In summary, we track via a pageflag (PG_anon_exclusive) whether a mapped anonymous page is exclusive. Exclusive anonymous pages that are mapped R/O can directly be mapped R/W by the COW logic in the write fault handler. Exclusive anonymous pages that want to be shared (fork(), KSM) first have to be marked shared -- which will fail if there are GUP pins on the page. GUP is only allowed to take a pin on anonymous pages that are exclusive. The PT lock is the primary mechanism to synchronize modifications of PG_anon_exclusive. We synchronize against GUP-fast either via the src_mm->write_protect_seq (during fork()) or via clear/invalidate+flush of the relevant page table entry. Special care has to be taken about swap, migration, and THPs (whereby a PMD-mapping can be converted to a PTE mapping and we have to track information for subpages). Besides these, we let the rmap code handle most magic. For reliable R/O pins of anonymous pages, we need FAULT_FLAG_UNSHARE logic as part of our previous approach [2], however, it's now 100% mapcount free and I further simplified it a bit. #1 is a fix #3-#10 are mostly rmap preparations for PG_anon_exclusive handling #11 introduces PG_anon_exclusive #12 uses PG_anon_exclusive and make R/W pins of anonymous pages reliable #13 is a preparation for reliable R/O pins #14 and #15 is reused/modified GUP-triggered unsharing for R/O GUP pins make R/O pins of anonymous pages reliable #16 adds sanity check when (un)pinning anonymous pages [1] https://lkml.kernel.org/r/[email protected] [2] https://lkml.kernel.org/r/[email protected] [3] https://lore.kernel.org/r/[email protected] [4] https://bugzilla.kernel.org/show_bug.cgi?id=215616 This patch (of 16): In case arch_unmap_one() fails, we already did a swap_duplicate(). let's undo that properly via swap_free(). Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: ca827d5 ("mm, swap: Add infrastructure for saving page metadata on swap") Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Khalid Aziz <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: David Rientjes <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: John Hubbard <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Yang Shi <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Jann Horn <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Peter Xu <[email protected]> Cc: Don Dutile <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Jan Kara <[email protected]> Cc: Liang Zhang <[email protected]> Cc: Pedro Demarchi Gomes <[email protected]> Cc: Oded Gabbay <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Stephen Rothwell <[email protected]>
…ne() failed Patch series "mm: COW fixes part 2: reliable GUP pins of anonymous pages", v3. This series is the result of the discussion on the previous approach [2]. More information on the general COW issues can be found there. It is based on latest linus/master (post v5.17, with relevant core-MM changes for v5.18-rc1). This series fixes memory corruptions when a GUP pin (FOLL_PIN) was taken on an anonymous page and COW logic fails to detect exclusivity of the page to then replacing the anonymous page by a copy in the page table: The GUP pin lost synchronicity with the pages mapped into the page tables. This issue, including other related COW issues, has been summarized in [3] under 3): " 3. Intra Process Memory Corruptions due to Wrong COW (FOLL_PIN) page_maybe_dma_pinned() is used to check if a page may be pinned for DMA (using FOLL_PIN instead of FOLL_GET). While false positives are tolerable, false negatives are problematic: pages that are pinned for DMA must not be added to the swapcache. If it happens, the (now pinned) page could be faulted back from the swapcache into page tables read-only. Future write-access would detect the pinning and COW the page, losing synchronicity. For the interested reader, this is nicely documented in feb889f ("mm: don't put pinned pages into the swap cache"). Peter reports [8] that page_maybe_dma_pinned() as used is racy in some cases and can result in a violation of the documented semantics: giving false negatives because of the race. There are cases where we call it without properly taking a per-process sequence lock, turning the usage of page_maybe_dma_pinned() racy. While one case (clear_refs SOFTDIRTY tracking, see below) seems to be easy to handle, there is especially one rmap case (shrink_page_list) that's hard to fix: in the rmap world, we're not limited to a single process. The shrink_page_list() issue is really subtle. If we race with someone pinning a page, we can trigger the same issue as in the FOLL_GET case. See the detail section at the end of this mail on a discussion how bad this can bite us with VFIO or other FOLL_PIN user. It's harder to reproduce, but I managed to modify the O_DIRECT reproducer to use io_uring fixed buffers [15] instead, which ends up using FOLL_PIN | FOLL_WRITE | FOLL_LONGTERM to pin buffer pages and can similarly trigger a loss of synchronicity and consequently a memory corruption. Again, the root issue is that a write-fault on a page that has additional references results in a COW and thereby a loss of synchronicity and consequently a memory corruption if two parties believe they are referencing the same page. " This series makes GUP pins (R/O and R/W) on anonymous pages fully reliable, especially also taking care of concurrent pinning via GUP-fast, for example, also fully fixing an issue reported regarding NUMA balancing [4] recently. While doing that, it further reduces "unnecessary COWs", especially when we don't fork()/KSM and don't swapout, and fixes the COW security for hugetlb for FOLL_PIN. In summary, we track via a pageflag (PG_anon_exclusive) whether a mapped anonymous page is exclusive. Exclusive anonymous pages that are mapped R/O can directly be mapped R/W by the COW logic in the write fault handler. Exclusive anonymous pages that want to be shared (fork(), KSM) first have to be marked shared -- which will fail if there are GUP pins on the page. GUP is only allowed to take a pin on anonymous pages that are exclusive. The PT lock is the primary mechanism to synchronize modifications of PG_anon_exclusive. We synchronize against GUP-fast either via the src_mm->write_protect_seq (during fork()) or via clear/invalidate+flush of the relevant page table entry. Special care has to be taken about swap, migration, and THPs (whereby a PMD-mapping can be converted to a PTE mapping and we have to track information for subpages). Besides these, we let the rmap code handle most magic. For reliable R/O pins of anonymous pages, we need FAULT_FLAG_UNSHARE logic as part of our previous approach [2], however, it's now 100% mapcount free and I further simplified it a bit. #1 is a fix #3-#10 are mostly rmap preparations for PG_anon_exclusive handling #11 introduces PG_anon_exclusive #12 uses PG_anon_exclusive and make R/W pins of anonymous pages reliable #13 is a preparation for reliable R/O pins #14 and #15 is reused/modified GUP-triggered unsharing for R/O GUP pins make R/O pins of anonymous pages reliable #16 adds sanity check when (un)pinning anonymous pages [1] https://lkml.kernel.org/r/[email protected] [2] https://lkml.kernel.org/r/[email protected] [3] https://lore.kernel.org/r/[email protected] [4] https://bugzilla.kernel.org/show_bug.cgi?id=215616 This patch (of 16): In case arch_unmap_one() fails, we already did a swap_duplicate(). let's undo that properly via swap_free(). Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: ca827d5 ("mm, swap: Add infrastructure for saving page metadata on swap") Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Khalid Aziz <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: David Rientjes <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: John Hubbard <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Yang Shi <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Jann Horn <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Peter Xu <[email protected]> Cc: Don Dutile <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Jan Kara <[email protected]> Cc: Liang Zhang <[email protected]> Cc: Pedro Demarchi Gomes <[email protected]> Cc: Oded Gabbay <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Stephen Rothwell <[email protected]>
…ne() failed Patch series "mm: COW fixes part 2: reliable GUP pins of anonymous pages", v3. This series is the result of the discussion on the previous approach [2]. More information on the general COW issues can be found there. It is based on latest linus/master (post v5.17, with relevant core-MM changes for v5.18-rc1). This series fixes memory corruptions when a GUP pin (FOLL_PIN) was taken on an anonymous page and COW logic fails to detect exclusivity of the page to then replacing the anonymous page by a copy in the page table: The GUP pin lost synchronicity with the pages mapped into the page tables. This issue, including other related COW issues, has been summarized in [3] under 3): " 3. Intra Process Memory Corruptions due to Wrong COW (FOLL_PIN) page_maybe_dma_pinned() is used to check if a page may be pinned for DMA (using FOLL_PIN instead of FOLL_GET). While false positives are tolerable, false negatives are problematic: pages that are pinned for DMA must not be added to the swapcache. If it happens, the (now pinned) page could be faulted back from the swapcache into page tables read-only. Future write-access would detect the pinning and COW the page, losing synchronicity. For the interested reader, this is nicely documented in feb889f ("mm: don't put pinned pages into the swap cache"). Peter reports [8] that page_maybe_dma_pinned() as used is racy in some cases and can result in a violation of the documented semantics: giving false negatives because of the race. There are cases where we call it without properly taking a per-process sequence lock, turning the usage of page_maybe_dma_pinned() racy. While one case (clear_refs SOFTDIRTY tracking, see below) seems to be easy to handle, there is especially one rmap case (shrink_page_list) that's hard to fix: in the rmap world, we're not limited to a single process. The shrink_page_list() issue is really subtle. If we race with someone pinning a page, we can trigger the same issue as in the FOLL_GET case. See the detail section at the end of this mail on a discussion how bad this can bite us with VFIO or other FOLL_PIN user. It's harder to reproduce, but I managed to modify the O_DIRECT reproducer to use io_uring fixed buffers [15] instead, which ends up using FOLL_PIN | FOLL_WRITE | FOLL_LONGTERM to pin buffer pages and can similarly trigger a loss of synchronicity and consequently a memory corruption. Again, the root issue is that a write-fault on a page that has additional references results in a COW and thereby a loss of synchronicity and consequently a memory corruption if two parties believe they are referencing the same page. " This series makes GUP pins (R/O and R/W) on anonymous pages fully reliable, especially also taking care of concurrent pinning via GUP-fast, for example, also fully fixing an issue reported regarding NUMA balancing [4] recently. While doing that, it further reduces "unnecessary COWs", especially when we don't fork()/KSM and don't swapout, and fixes the COW security for hugetlb for FOLL_PIN. In summary, we track via a pageflag (PG_anon_exclusive) whether a mapped anonymous page is exclusive. Exclusive anonymous pages that are mapped R/O can directly be mapped R/W by the COW logic in the write fault handler. Exclusive anonymous pages that want to be shared (fork(), KSM) first have to be marked shared -- which will fail if there are GUP pins on the page. GUP is only allowed to take a pin on anonymous pages that are exclusive. The PT lock is the primary mechanism to synchronize modifications of PG_anon_exclusive. We synchronize against GUP-fast either via the src_mm->write_protect_seq (during fork()) or via clear/invalidate+flush of the relevant page table entry. Special care has to be taken about swap, migration, and THPs (whereby a PMD-mapping can be converted to a PTE mapping and we have to track information for subpages). Besides these, we let the rmap code handle most magic. For reliable R/O pins of anonymous pages, we need FAULT_FLAG_UNSHARE logic as part of our previous approach [2], however, it's now 100% mapcount free and I further simplified it a bit. #1 is a fix #3-#10 are mostly rmap preparations for PG_anon_exclusive handling #11 introduces PG_anon_exclusive #12 uses PG_anon_exclusive and make R/W pins of anonymous pages reliable #13 is a preparation for reliable R/O pins #14 and #15 is reused/modified GUP-triggered unsharing for R/O GUP pins make R/O pins of anonymous pages reliable #16 adds sanity check when (un)pinning anonymous pages [1] https://lkml.kernel.org/r/[email protected] [2] https://lkml.kernel.org/r/[email protected] [3] https://lore.kernel.org/r/[email protected] [4] https://bugzilla.kernel.org/show_bug.cgi?id=215616 This patch (of 16): In case arch_unmap_one() fails, we already did a swap_duplicate(). let's undo that properly via swap_free(). Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: ca827d5 ("mm, swap: Add infrastructure for saving page metadata on swap") Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Khalid Aziz <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: David Rientjes <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: John Hubbard <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Yang Shi <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Jann Horn <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Peter Xu <[email protected]> Cc: Don Dutile <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Jan Kara <[email protected]> Cc: Liang Zhang <[email protected]> Cc: Pedro Demarchi Gomes <[email protected]> Cc: Oded Gabbay <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Stephen Rothwell <[email protected]>
…ne() failed Patch series "mm: COW fixes part 2: reliable GUP pins of anonymous pages", v4. This series is the result of the discussion on the previous approach [2]. More information on the general COW issues can be found there. It is based on latest linus/master (post v5.17, with relevant core-MM changes for v5.18-rc1). This series fixes memory corruptions when a GUP pin (FOLL_PIN) was taken on an anonymous page and COW logic fails to detect exclusivity of the page to then replacing the anonymous page by a copy in the page table: The GUP pin lost synchronicity with the pages mapped into the page tables. This issue, including other related COW issues, has been summarized in [3] under 3): " 3. Intra Process Memory Corruptions due to Wrong COW (FOLL_PIN) page_maybe_dma_pinned() is used to check if a page may be pinned for DMA (using FOLL_PIN instead of FOLL_GET). While false positives are tolerable, false negatives are problematic: pages that are pinned for DMA must not be added to the swapcache. If it happens, the (now pinned) page could be faulted back from the swapcache into page tables read-only. Future write-access would detect the pinning and COW the page, losing synchronicity. For the interested reader, this is nicely documented in feb889f ("mm: don't put pinned pages into the swap cache"). Peter reports [8] that page_maybe_dma_pinned() as used is racy in some cases and can result in a violation of the documented semantics: giving false negatives because of the race. There are cases where we call it without properly taking a per-process sequence lock, turning the usage of page_maybe_dma_pinned() racy. While one case (clear_refs SOFTDIRTY tracking, see below) seems to be easy to handle, there is especially one rmap case (shrink_page_list) that's hard to fix: in the rmap world, we're not limited to a single process. The shrink_page_list() issue is really subtle. If we race with someone pinning a page, we can trigger the same issue as in the FOLL_GET case. See the detail section at the end of this mail on a discussion how bad this can bite us with VFIO or other FOLL_PIN user. It's harder to reproduce, but I managed to modify the O_DIRECT reproducer to use io_uring fixed buffers [15] instead, which ends up using FOLL_PIN | FOLL_WRITE | FOLL_LONGTERM to pin buffer pages and can similarly trigger a loss of synchronicity and consequently a memory corruption. Again, the root issue is that a write-fault on a page that has additional references results in a COW and thereby a loss of synchronicity and consequently a memory corruption if two parties believe they are referencing the same page. " This series makes GUP pins (R/O and R/W) on anonymous pages fully reliable, especially also taking care of concurrent pinning via GUP-fast, for example, also fully fixing an issue reported regarding NUMA balancing [4] recently. While doing that, it further reduces "unnecessary COWs", especially when we don't fork()/KSM and don't swapout, and fixes the COW security for hugetlb for FOLL_PIN. In summary, we track via a pageflag (PG_anon_exclusive) whether a mapped anonymous page is exclusive. Exclusive anonymous pages that are mapped R/O can directly be mapped R/W by the COW logic in the write fault handler. Exclusive anonymous pages that want to be shared (fork(), KSM) first have to be marked shared -- which will fail if there are GUP pins on the page. GUP is only allowed to take a pin on anonymous pages that are exclusive. The PT lock is the primary mechanism to synchronize modifications of PG_anon_exclusive. We synchronize against GUP-fast either via the src_mm->write_protect_seq (during fork()) or via clear/invalidate+flush of the relevant page table entry. Special care has to be taken about swap, migration, and THPs (whereby a PMD-mapping can be converted to a PTE mapping and we have to track information for subpages). Besides these, we let the rmap code handle most magic. For reliable R/O pins of anonymous pages, we need FAULT_FLAG_UNSHARE logic as part of our previous approach [2], however, it's now 100% mapcount free and I further simplified it a bit. #1 is a fix #3-#10 are mostly rmap preparations for PG_anon_exclusive handling #11 introduces PG_anon_exclusive #12 uses PG_anon_exclusive and make R/W pins of anonymous pages reliable #13 is a preparation for reliable R/O pins #14 and #15 is reused/modified GUP-triggered unsharing for R/O GUP pins make R/O pins of anonymous pages reliable #16 adds sanity check when (un)pinning anonymous pages [1] https://lkml.kernel.org/r/[email protected] [2] https://lkml.kernel.org/r/[email protected] [3] https://lore.kernel.org/r/[email protected] [4] https://bugzilla.kernel.org/show_bug.cgi?id=215616 This patch (of 17): In case arch_unmap_one() fails, we already did a swap_duplicate(). let's undo that properly via swap_free(). Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: ca827d5 ("mm, swap: Add infrastructure for saving page metadata on swap") Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Khalid Aziz <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: David Rientjes <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: John Hubbard <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Yang Shi <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Jann Horn <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Peter Xu <[email protected]> Cc: Don Dutile <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Jan Kara <[email protected]> Cc: Liang Zhang <[email protected]> Cc: Pedro Demarchi Gomes <[email protected]> Cc: Oded Gabbay <[email protected]> Cc: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
…ne() failed Patch series "mm: COW fixes part 2: reliable GUP pins of anonymous pages", v4. This series is the result of the discussion on the previous approach [2]. More information on the general COW issues can be found there. It is based on latest linus/master (post v5.17, with relevant core-MM changes for v5.18-rc1). This series fixes memory corruptions when a GUP pin (FOLL_PIN) was taken on an anonymous page and COW logic fails to detect exclusivity of the page to then replacing the anonymous page by a copy in the page table: The GUP pin lost synchronicity with the pages mapped into the page tables. This issue, including other related COW issues, has been summarized in [3] under 3): " 3. Intra Process Memory Corruptions due to Wrong COW (FOLL_PIN) page_maybe_dma_pinned() is used to check if a page may be pinned for DMA (using FOLL_PIN instead of FOLL_GET). While false positives are tolerable, false negatives are problematic: pages that are pinned for DMA must not be added to the swapcache. If it happens, the (now pinned) page could be faulted back from the swapcache into page tables read-only. Future write-access would detect the pinning and COW the page, losing synchronicity. For the interested reader, this is nicely documented in feb889f ("mm: don't put pinned pages into the swap cache"). Peter reports [8] that page_maybe_dma_pinned() as used is racy in some cases and can result in a violation of the documented semantics: giving false negatives because of the race. There are cases where we call it without properly taking a per-process sequence lock, turning the usage of page_maybe_dma_pinned() racy. While one case (clear_refs SOFTDIRTY tracking, see below) seems to be easy to handle, there is especially one rmap case (shrink_page_list) that's hard to fix: in the rmap world, we're not limited to a single process. The shrink_page_list() issue is really subtle. If we race with someone pinning a page, we can trigger the same issue as in the FOLL_GET case. See the detail section at the end of this mail on a discussion how bad this can bite us with VFIO or other FOLL_PIN user. It's harder to reproduce, but I managed to modify the O_DIRECT reproducer to use io_uring fixed buffers [15] instead, which ends up using FOLL_PIN | FOLL_WRITE | FOLL_LONGTERM to pin buffer pages and can similarly trigger a loss of synchronicity and consequently a memory corruption. Again, the root issue is that a write-fault on a page that has additional references results in a COW and thereby a loss of synchronicity and consequently a memory corruption if two parties believe they are referencing the same page. " This series makes GUP pins (R/O and R/W) on anonymous pages fully reliable, especially also taking care of concurrent pinning via GUP-fast, for example, also fully fixing an issue reported regarding NUMA balancing [4] recently. While doing that, it further reduces "unnecessary COWs", especially when we don't fork()/KSM and don't swapout, and fixes the COW security for hugetlb for FOLL_PIN. In summary, we track via a pageflag (PG_anon_exclusive) whether a mapped anonymous page is exclusive. Exclusive anonymous pages that are mapped R/O can directly be mapped R/W by the COW logic in the write fault handler. Exclusive anonymous pages that want to be shared (fork(), KSM) first have to be marked shared -- which will fail if there are GUP pins on the page. GUP is only allowed to take a pin on anonymous pages that are exclusive. The PT lock is the primary mechanism to synchronize modifications of PG_anon_exclusive. We synchronize against GUP-fast either via the src_mm->write_protect_seq (during fork()) or via clear/invalidate+flush of the relevant page table entry. Special care has to be taken about swap, migration, and THPs (whereby a PMD-mapping can be converted to a PTE mapping and we have to track information for subpages). Besides these, we let the rmap code handle most magic. For reliable R/O pins of anonymous pages, we need FAULT_FLAG_UNSHARE logic as part of our previous approach [2], however, it's now 100% mapcount free and I further simplified it a bit. #1 is a fix #3-#10 are mostly rmap preparations for PG_anon_exclusive handling #11 introduces PG_anon_exclusive #12 uses PG_anon_exclusive and make R/W pins of anonymous pages reliable #13 is a preparation for reliable R/O pins #14 and #15 is reused/modified GUP-triggered unsharing for R/O GUP pins make R/O pins of anonymous pages reliable #16 adds sanity check when (un)pinning anonymous pages [1] https://lkml.kernel.org/r/[email protected] [2] https://lkml.kernel.org/r/[email protected] [3] https://lore.kernel.org/r/[email protected] [4] https://bugzilla.kernel.org/show_bug.cgi?id=215616 This patch (of 17): In case arch_unmap_one() fails, we already did a swap_duplicate(). let's undo that properly via swap_free(). Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: ca827d5 ("mm, swap: Add infrastructure for saving page metadata on swap") Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Khalid Aziz <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: David Rientjes <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: John Hubbard <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Yang Shi <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Jann Horn <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Peter Xu <[email protected]> Cc: Don Dutile <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Jan Kara <[email protected]> Cc: Liang Zhang <[email protected]> Cc: Pedro Demarchi Gomes <[email protected]> Cc: Oded Gabbay <[email protected]> Cc: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
…ne() failed Patch series "mm: COW fixes part 2: reliable GUP pins of anonymous pages", v4. This series is the result of the discussion on the previous approach [2]. More information on the general COW issues can be found there. It is based on latest linus/master (post v5.17, with relevant core-MM changes for v5.18-rc1). This series fixes memory corruptions when a GUP pin (FOLL_PIN) was taken on an anonymous page and COW logic fails to detect exclusivity of the page to then replacing the anonymous page by a copy in the page table: The GUP pin lost synchronicity with the pages mapped into the page tables. This issue, including other related COW issues, has been summarized in [3] under 3): " 3. Intra Process Memory Corruptions due to Wrong COW (FOLL_PIN) page_maybe_dma_pinned() is used to check if a page may be pinned for DMA (using FOLL_PIN instead of FOLL_GET). While false positives are tolerable, false negatives are problematic: pages that are pinned for DMA must not be added to the swapcache. If it happens, the (now pinned) page could be faulted back from the swapcache into page tables read-only. Future write-access would detect the pinning and COW the page, losing synchronicity. For the interested reader, this is nicely documented in feb889f ("mm: don't put pinned pages into the swap cache"). Peter reports [8] that page_maybe_dma_pinned() as used is racy in some cases and can result in a violation of the documented semantics: giving false negatives because of the race. There are cases where we call it without properly taking a per-process sequence lock, turning the usage of page_maybe_dma_pinned() racy. While one case (clear_refs SOFTDIRTY tracking, see below) seems to be easy to handle, there is especially one rmap case (shrink_page_list) that's hard to fix: in the rmap world, we're not limited to a single process. The shrink_page_list() issue is really subtle. If we race with someone pinning a page, we can trigger the same issue as in the FOLL_GET case. See the detail section at the end of this mail on a discussion how bad this can bite us with VFIO or other FOLL_PIN user. It's harder to reproduce, but I managed to modify the O_DIRECT reproducer to use io_uring fixed buffers [15] instead, which ends up using FOLL_PIN | FOLL_WRITE | FOLL_LONGTERM to pin buffer pages and can similarly trigger a loss of synchronicity and consequently a memory corruption. Again, the root issue is that a write-fault on a page that has additional references results in a COW and thereby a loss of synchronicity and consequently a memory corruption if two parties believe they are referencing the same page. " This series makes GUP pins (R/O and R/W) on anonymous pages fully reliable, especially also taking care of concurrent pinning via GUP-fast, for example, also fully fixing an issue reported regarding NUMA balancing [4] recently. While doing that, it further reduces "unnecessary COWs", especially when we don't fork()/KSM and don't swapout, and fixes the COW security for hugetlb for FOLL_PIN. In summary, we track via a pageflag (PG_anon_exclusive) whether a mapped anonymous page is exclusive. Exclusive anonymous pages that are mapped R/O can directly be mapped R/W by the COW logic in the write fault handler. Exclusive anonymous pages that want to be shared (fork(), KSM) first have to be marked shared -- which will fail if there are GUP pins on the page. GUP is only allowed to take a pin on anonymous pages that are exclusive. The PT lock is the primary mechanism to synchronize modifications of PG_anon_exclusive. We synchronize against GUP-fast either via the src_mm->write_protect_seq (during fork()) or via clear/invalidate+flush of the relevant page table entry. Special care has to be taken about swap, migration, and THPs (whereby a PMD-mapping can be converted to a PTE mapping and we have to track information for subpages). Besides these, we let the rmap code handle most magic. For reliable R/O pins of anonymous pages, we need FAULT_FLAG_UNSHARE logic as part of our previous approach [2], however, it's now 100% mapcount free and I further simplified it a bit. #1 is a fix #3-#10 are mostly rmap preparations for PG_anon_exclusive handling #11 introduces PG_anon_exclusive #12 uses PG_anon_exclusive and make R/W pins of anonymous pages reliable #13 is a preparation for reliable R/O pins #14 and #15 is reused/modified GUP-triggered unsharing for R/O GUP pins make R/O pins of anonymous pages reliable #16 adds sanity check when (un)pinning anonymous pages [1] https://lkml.kernel.org/r/[email protected] [2] https://lkml.kernel.org/r/[email protected] [3] https://lore.kernel.org/r/[email protected] [4] https://bugzilla.kernel.org/show_bug.cgi?id=215616 This patch (of 17): In case arch_unmap_one() fails, we already did a swap_duplicate(). let's undo that properly via swap_free(). Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: ca827d5 ("mm, swap: Add infrastructure for saving page metadata on swap") Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Khalid Aziz <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: David Rientjes <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: John Hubbard <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Yang Shi <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Jann Horn <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Peter Xu <[email protected]> Cc: Don Dutile <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Jan Kara <[email protected]> Cc: Liang Zhang <[email protected]> Cc: Pedro Demarchi Gomes <[email protected]> Cc: Oded Gabbay <[email protected]> Cc: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
Change net device's MTU to smaller than IPV6_MIN_MTU or unregister device while matching route. That may trigger null-ptr-deref bug for ip6_ptr probability as following. ========================================================= BUG: KASAN: null-ptr-deref in find_match.part.0+0x70/0x134 Read of size 4 at addr 0000000000000308 by task ping6/263 CPU: 2 PID: 263 Comm: ping6 Not tainted 5.19.0-rc7+ #14 Call trace: dump_backtrace+0x1a8/0x230 show_stack+0x20/0x70 dump_stack_lvl+0x68/0x84 print_report+0xc4/0x120 kasan_report+0x84/0x120 __asan_load4+0x94/0xd0 find_match.part.0+0x70/0x134 __find_rr_leaf+0x408/0x470 fib6_table_lookup+0x264/0x540 ip6_pol_route+0xf4/0x260 ip6_pol_route_output+0x58/0x70 fib6_rule_lookup+0x1a8/0x330 ip6_route_output_flags_noref+0xd8/0x1a0 ip6_route_output_flags+0x58/0x160 ip6_dst_lookup_tail+0x5b4/0x85c ip6_dst_lookup_flow+0x98/0x120 rawv6_sendmsg+0x49c/0xc70 inet_sendmsg+0x68/0x94 Reproducer as following: Firstly, prepare conditions: $ip netns add ns1 $ip netns add ns2 $ip link add veth1 type veth peer name veth2 $ip link set veth1 netns ns1 $ip link set veth2 netns ns2 $ip netns exec ns1 ip -6 addr add 2001:0db8:0:f101::1/64 dev veth1 $ip netns exec ns2 ip -6 addr add 2001:0db8:0:f101::2/64 dev veth2 $ip netns exec ns1 ifconfig veth1 up $ip netns exec ns2 ifconfig veth2 up $ip netns exec ns1 ip -6 route add 2000::/64 dev veth1 metric 1 $ip netns exec ns2 ip -6 route add 2001::/64 dev veth2 metric 1 Secondly, execute the following two commands in two ssh windows respectively: $ip netns exec ns1 sh $while true; do ip -6 addr add 2001:0db8:0:f101::1/64 dev veth1; ip -6 route add 2000::/64 dev veth1 metric 1; ping6 2000::2; done $ip netns exec ns1 sh $while true; do ip link set veth1 mtu 1000; ip link set veth1 mtu 1500; sleep 5; done It is because ip6_ptr has been assigned to NULL in addrconf_ifdown() firstly, then ip6_ignore_linkdown() accesses ip6_ptr directly without NULL check. cpu0 cpu1 fib6_table_lookup __find_rr_leaf addrconf_notify [ NETDEV_CHANGEMTU ] addrconf_ifdown RCU_INIT_POINTER(dev->ip6_ptr, NULL) find_match ip6_ignore_linkdown So we can add NULL check for ip6_ptr before using in ip6_ignore_linkdown() to fix the null-ptr-deref bug. Fixes: dcd1f57 ("net/ipv6: Remove fib6_idev") Signed-off-by: Ziyang Xuan <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
ASAN reports an use-after-free in btf_dump_name_dups: ERROR: AddressSanitizer: heap-use-after-free on address 0xffff927006db at pc 0xaaaab5dfb618 bp 0xffffdd89b890 sp 0xffffdd89b928 READ of size 2 at 0xffff927006db thread T0 #0 0xaaaab5dfb614 in __interceptor_strcmp.part.0 (test_progs+0x21b614) #1 0xaaaab635f144 in str_equal_fn tools/lib/bpf/btf_dump.c:127 #2 0xaaaab635e3e0 in hashmap_find_entry tools/lib/bpf/hashmap.c:143 #3 0xaaaab635e72c in hashmap__find tools/lib/bpf/hashmap.c:212 #4 0xaaaab6362258 in btf_dump_name_dups tools/lib/bpf/btf_dump.c:1525 #5 0xaaaab636240c in btf_dump_resolve_name tools/lib/bpf/btf_dump.c:1552 #6 0xaaaab6362598 in btf_dump_type_name tools/lib/bpf/btf_dump.c:1567 #7 0xaaaab6360b48 in btf_dump_emit_struct_def tools/lib/bpf/btf_dump.c:912 #8 0xaaaab6360630 in btf_dump_emit_type tools/lib/bpf/btf_dump.c:798 #9 0xaaaab635f720 in btf_dump__dump_type tools/lib/bpf/btf_dump.c:282 #10 0xaaaab608523c in test_btf_dump_incremental tools/testing/selftests/bpf/prog_tests/btf_dump.c:236 #11 0xaaaab6097530 in test_btf_dump tools/testing/selftests/bpf/prog_tests/btf_dump.c:875 #12 0xaaaab6314ed0 in run_one_test tools/testing/selftests/bpf/test_progs.c:1062 #13 0xaaaab631a0a8 in main tools/testing/selftests/bpf/test_progs.c:1697 #14 0xffff9676d214 in __libc_start_main ../csu/libc-start.c:308 #15 0xaaaab5d65990 (test_progs+0x185990) 0xffff927006db is located 11 bytes inside of 16-byte region [0xffff927006d0,0xffff927006e0) freed by thread T0 here: #0 0xaaaab5e2c7c4 in realloc (test_progs+0x24c7c4) #1 0xaaaab634f4a0 in libbpf_reallocarray tools/lib/bpf/libbpf_internal.h:191 #2 0xaaaab634f840 in libbpf_add_mem tools/lib/bpf/btf.c:163 #3 0xaaaab636643c in strset_add_str_mem tools/lib/bpf/strset.c:106 #4 0xaaaab6366560 in strset__add_str tools/lib/bpf/strset.c:157 #5 0xaaaab6352d70 in btf__add_str tools/lib/bpf/btf.c:1519 #6 0xaaaab6353e10 in btf__add_field tools/lib/bpf/btf.c:2032 #7 0xaaaab6084fcc in test_btf_dump_incremental tools/testing/selftests/bpf/prog_tests/btf_dump.c:232 #8 0xaaaab6097530 in test_btf_dump tools/testing/selftests/bpf/prog_tests/btf_dump.c:875 #9 0xaaaab6314ed0 in run_one_test tools/testing/selftests/bpf/test_progs.c:1062 #10 0xaaaab631a0a8 in main tools/testing/selftests/bpf/test_progs.c:1697 #11 0xffff9676d214 in __libc_start_main ../csu/libc-start.c:308 #12 0xaaaab5d65990 (test_progs+0x185990) previously allocated by thread T0 here: #0 0xaaaab5e2c7c4 in realloc (test_progs+0x24c7c4) #1 0xaaaab634f4a0 in libbpf_reallocarray tools/lib/bpf/libbpf_internal.h:191 #2 0xaaaab634f840 in libbpf_add_mem tools/lib/bpf/btf.c:163 #3 0xaaaab636643c in strset_add_str_mem tools/lib/bpf/strset.c:106 #4 0xaaaab6366560 in strset__add_str tools/lib/bpf/strset.c:157 #5 0xaaaab6352d70 in btf__add_str tools/lib/bpf/btf.c:1519 #6 0xaaaab6353ff0 in btf_add_enum_common tools/lib/bpf/btf.c:2070 #7 0xaaaab6354080 in btf__add_enum tools/lib/bpf/btf.c:2102 #8 0xaaaab6082f50 in test_btf_dump_incremental tools/testing/selftests/bpf/prog_tests/btf_dump.c:162 #9 0xaaaab6097530 in test_btf_dump tools/testing/selftests/bpf/prog_tests/btf_dump.c:875 #10 0xaaaab6314ed0 in run_one_test tools/testing/selftests/bpf/test_progs.c:1062 #11 0xaaaab631a0a8 in main tools/testing/selftests/bpf/test_progs.c:1697 #12 0xffff9676d214 in __libc_start_main ../csu/libc-start.c:308 #13 0xaaaab5d65990 (test_progs+0x185990) The reason is that the key stored in hash table name_map is a string address, and the string memory is allocated by realloc() function, when the memory is resized by realloc() later, the old memory may be freed, so the address stored in name_map references to a freed memory, causing use-after-free. Fix it by storing duplicated string address in name_map. Fixes: 919d2b1 ("libbpf: Allow modification of BTF and add btf__add_str API") Signed-off-by: Xu Kuohai <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
I don't unfortunately have any logs (yet) but I am having regular WSL crashes since updating to v59 from (if memory serves) v56. I don't remember changing anything else. It always seems to happen when the host comes out of (deep) sleep. Very, very occasionally I had such issues (once every 1-2 months max) with previous kernels but now I'm getting them every 1-2 days. I am not having any issues, except when coming out of sleep on the host.
Is anyone else having these issues? Should I be looking somewhere other than the new v59 kernel (ie., I must have forgotten some other update)? If this is the right place, any pointers on what logs I should be checking?
Thanks!
The text was updated successfully, but these errors were encountered: