Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

6.5: mm patches #2

Merged
merged 8 commits into from
Sep 23, 2023
Merged

6.5: mm patches #2

merged 8 commits into from
Sep 23, 2023

Commits on Sep 23, 2023

  1. mm: Disable watermark boosting by default

    What watermark boosting does is preemptively fire up kswapd to free
    memory when there hasn't been an allocation failure. It does this by
    increasing kswapd's high watermark goal and then firing up kswapd. The
    reason why this causes freezes is because, with the increased high
    watermark goal, kswapd will steal memory from processes that need it in
    order to make forward progress. These processes will, in turn, try to
    allocate memory again, which will cause kswapd to steal necessary pages
    from those processes again, in a positive feedback loop known as page
    thrashing. When page thrashing occurs, your system is essentially
    livelocked until the necessary forward progress can be made to stop
    processes from trying to continuously allocate memory and trigger
    kswapd to steal it back.
    
    This problem already occurs with kswapd *without* watermark boosting,
    but it's usually only encountered on machines with a small amount of
    memory and/or a slow CPU. Watermark boosting just makes the existing
    problem worse enough to notice on higher spec'd machines.
    
    Disable watermark boosting by default since it's a total dumpster fire.
    I can't imagine why anyone would want to explicitly enable it, but the
    option is there in case someone does.
    
    Signed-off-by: Sultan Alsawaf <[email protected]>
    kerneltoast authored and Rasenkai committed Sep 23, 2023
    Configuration menu
    Copy the full SHA
    406ce3e View commit details
    Browse the repository at this point in the history
  2. mm: Stop kswapd early when nothing's waiting for it to free pages

    Keeping kswapd running when all the failed allocations that invoked it
    are satisfied incurs a high overhead due to unnecessary page eviction
    and writeback, as well as spurious VM pressure events to various
    registered shrinkers. When kswapd doesn't need to work to make an
    allocation succeed anymore, stop it prematurely to save resources.
    
    Signed-off-by: Sultan Alsawaf <[email protected]>
    kerneltoast authored and Rasenkai committed Sep 23, 2023
    Configuration menu
    Copy the full SHA
    5d27c2a View commit details
    Browse the repository at this point in the history
  3. mm: Don't stop kswapd on a per-node basis when there are no waiters

    The page allocator wakes all kswapds in an allocation context's allowed
    nodemask in the slow path, so it doesn't make sense to have the kswapd-
    waiter count per each NUMA node. Instead, it should be a global counter
    to stop all kswapds when there are no failed allocation requests.
    
    Signed-off-by: Sultan Alsawaf <[email protected]>
    kerneltoast authored and Rasenkai committed Sep 23, 2023
    Configuration menu
    Copy the full SHA
    e7ce283 View commit details
    Browse the repository at this point in the history
  4. mm: Increment kswapd_waiters for throttled direct reclaimers

    Throttled direct reclaimers will wake up kswapd and wait for kswapd to
    satisfy their page allocation request, even when the failed allocation
    lacks the __GFP_KSWAPD_RECLAIM flag in its gfp mask. As a result, kswapd
    may think that there are no waiters and thus exit prematurely, causing
    throttled direct reclaimers lacking __GFP_KSWAPD_RECLAIM to stall on
    waiting for kswapd to wake them up. Incrementing the kswapd_waiters
    counter when such direct reclaimers become throttled fixes the problem.
    
    Signed-off-by: Sultan Alsawaf <[email protected]>
    kerneltoast authored and Rasenkai committed Sep 23, 2023
    Configuration menu
    Copy the full SHA
    5dbfdf0 View commit details
    Browse the repository at this point in the history
  5. mm: Disable proactive compaction by default

    On-demand compaction works fine assuming that you don't have a need to
    spam the page allocator nonstop for large order page allocations.
    
    Signed-off-by: Sultan Alsawaf <[email protected]>
    kerneltoast authored and Rasenkai committed Sep 23, 2023
    Configuration menu
    Copy the full SHA
    2e3cb1a View commit details
    Browse the repository at this point in the history
  6. mm: Don't hog the CPU and zone lock in rmqueue_bulk()

    There is noticeable scheduling latency and heavy zone lock contention
    stemming from rmqueue_bulk's single hold of the zone lock while doing
    its work, as seen with the preemptoff tracer. There's no actual need for
    rmqueue_bulk() to hold the zone lock the entire time; it only does so
    for supposed efficiency. As such, we can relax the zone lock and even
    reschedule when IRQs are enabled in order to keep the scheduling delays
    and zone lock contention at bay. Forward progress is still guaranteed,
    as the zone lock can only be relaxed after page removal.
    
    With this change, rmqueue_bulk() no longer appears as a serious offender
    in the preemptoff tracer, and system latency is noticeably improved.
    
    Signed-off-by: Sultan Alsawaf <[email protected]>
    kerneltoast authored and Rasenkai committed Sep 23, 2023
    Configuration menu
    Copy the full SHA
    8253fa2 View commit details
    Browse the repository at this point in the history
  7. scatterlist: Don't allocate sg lists using __get_free_page

    Allocating pages with __get_free_page is slower than going through the
    slab allocator to grab free pages out from a pool.
    
    These are the results from running the code at the bottom of this
    message:
    [    1.278602] speedtest: __get_free_page: 9 us
    [    1.278606] speedtest: kmalloc: 4 us
    [    1.278609] speedtest: kmem_cache_alloc: 4 us
    [    1.278611] speedtest: vmalloc: 13 us
    
    kmalloc and kmem_cache_alloc (which is what kmalloc uses for common
    sizes behind the scenes) are the fastest choices. Use kmalloc to speed
    up sg list allocation.
    
    This is the code used to produce the above measurements:
    
    static int speedtest(void *data)
    {
    	static const struct sched_param sched_max_rt_prio = {
    		.sched_priority = MAX_RT_PRIO - 1
    	};
    	volatile s64 ctotal = 0, gtotal = 0, ktotal = 0, vtotal = 0;
    	struct kmem_cache *page_pool;
    	int i, j, trials = 1000;
    	volatile ktime_t start;
    	void *ptr[100];
    
    	sched_setscheduler_nocheck(current, SCHED_FIFO, &sched_max_rt_prio);
    
    	page_pool = kmem_cache_create("pages", PAGE_SIZE, PAGE_SIZE, SLAB_PANIC,
    				      NULL);
    	for (i = 0; i < trials; i++) {
    		start = ktime_get();
    		for (j = 0; j < ARRAY_SIZE(ptr); j++)
    			while (!(ptr[j] = kmem_cache_alloc(page_pool, GFP_KERNEL)));
    		ctotal += ktime_us_delta(ktime_get(), start);
    		for (j = 0; j < ARRAY_SIZE(ptr); j++)
    			kmem_cache_free(page_pool, ptr[j]);
    
    		start = ktime_get();
    		for (j = 0; j < ARRAY_SIZE(ptr); j++)
    			while (!(ptr[j] = (void *)__get_free_page(GFP_KERNEL)));
    		gtotal += ktime_us_delta(ktime_get(), start);
    		for (j = 0; j < ARRAY_SIZE(ptr); j++)
    			free_page((unsigned long)ptr[j]);
    
    		start = ktime_get();
    		for (j = 0; j < ARRAY_SIZE(ptr); j++)
    			while (!(ptr[j] = __kmalloc(PAGE_SIZE, GFP_KERNEL)));
    		ktotal += ktime_us_delta(ktime_get(), start);
    		for (j = 0; j < ARRAY_SIZE(ptr); j++)
    			kfree(ptr[j]);
    
    		start = ktime_get();
    		*ptr = vmalloc(ARRAY_SIZE(ptr) * PAGE_SIZE);
    		vtotal += ktime_us_delta(ktime_get(), start);
    		vfree(*ptr);
    	}
    	kmem_cache_destroy(page_pool);
    
    	printk("%s: __get_free_page: %lld us\n", __func__, gtotal / trials);
    	printk("%s: __kmalloc: %lld us\n", __func__, ktotal / trials);
    	printk("%s: kmem_cache_alloc: %lld us\n", __func__, ctotal / trials);
    	printk("%s: vmalloc: %lld us\n", __func__, vtotal / trials);
    	complete(data);
    	return 0;
    }
    
    static int __init start_test(void)
    {
    	DECLARE_COMPLETION_ONSTACK(done);
    
    	BUG_ON(IS_ERR(kthread_run(speedtest, &done, "malloc_test")));
    	wait_for_completion(&done);
    	return 0;
    }
    late_initcall(start_test);
    
    Signed-off-by: Sultan Alsawaf <[email protected]>
    kerneltoast authored and Rasenkai committed Sep 23, 2023
    Configuration menu
    Copy the full SHA
    e22ced7 View commit details
    Browse the repository at this point in the history
  8. mm: Omit RCU read lock in list_lru_count_one() when RCU isn't needed

    The RCU read lock isn't necessary in list_lru_count_one() when the
    condition that requires RCU (CONFIG_MEMCG && !CONFIG_SLOB) isn't met.
    The highly-frequent RCU lock and unlock adds measurable overhead to the
    shrink_slab() path when it isn't needed. As such, we can simply omit the
    RCU read lock in this case to improve performance.
    
    Signed-off-by: Sultan Alsawaf <[email protected]>
    Signed-off-by: Kazuki Hashimoto <[email protected]>
    kerneltoast authored and Rasenkai committed Sep 23, 2023
    Configuration menu
    Copy the full SHA
    132f9fd View commit details
    Browse the repository at this point in the history