Skip to content

Commit

Permalink
Merge uvite into amagas
Browse files Browse the repository at this point in the history
  • Loading branch information
forkposix committed Apr 3, 2024
2 parents 063bddb + b50261e commit f413d36
Show file tree
Hide file tree
Showing 210 changed files with 11,886 additions and 4,921 deletions.
19 changes: 7 additions & 12 deletions Documentation/admin-guide/cgroup-v1/memory.rst
Original file line number Diff line number Diff line change
Expand Up @@ -201,11 +201,11 @@ An RSS page is unaccounted when it's fully unmapped. A PageCache page is
unaccounted when it's removed from radix-tree. Even if RSS pages are fully
unmapped (by kswapd), they may exist as SwapCache in the system until they
are really freed. Such SwapCaches are also accounted.
A swapped-in page is not accounted until it's mapped.
A swapped-in page is accounted after adding into swapcache.

Note: The kernel does swapin-readahead and reads multiple swaps at once.
This means swapped-in pages may contain pages for other tasks than a task
causing page fault. So, we avoid accounting at swap-in I/O.
Since page's memcg recorded into swap whatever memsw enabled, the page will
be accounted after swapin.

At page migration, accounting information is kept.

Expand All @@ -224,18 +224,13 @@ the cgroup that brought it in -- this will happen on memory pressure).
But see section 8.2: when moving a task to another cgroup, its pages may
be recharged to the new cgroup, if move_charge_at_immigrate has been chosen.

Exception: If CONFIG_MEMCG_SWAP is not used.
When you do swapoff and make swapped-out pages of shmem(tmpfs) to
be backed into memory in force, charges for pages are accounted against the
caller of swapoff rather than the users of shmem.

2.4 Swap Extension (CONFIG_MEMCG_SWAP)
2.4 Swap Extension
--------------------------------------

Swap Extension allows you to record charge for swap. A swapped-in page is
charged back to original page allocator if possible.
Swap usage is always recorded for each of cgroup. Swap Extension allows you to
read and limit it.

When swap is accounted, following files are added.
When CONFIG_SWAP is enabled, following files are added.

- memory.memsw.usage_in_bytes.
- memory.memsw.limit_in_bytes.
Expand Down
11 changes: 11 additions & 0 deletions Documentation/admin-guide/cgroup-v2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -186,6 +186,17 @@ cgroup v2 currently supports the following mount options.
modified through remount from the init namespace. The mount
option is ignored on non-init namespace mounts.

memory_recursiveprot

Recursively apply memory.min and memory.low protection to
entire subtrees, without requiring explicit downward
propagation into leaf cgroups. This allows protecting entire
subtrees from one another, while retaining free competition
within those subtrees. This should have been the default
behavior but is a mount-option to avoid regressing setups
relying on the original semantics (e.g. specifying bogusly
high 'bypass' protection values at higher tree levels).


Organizing Processes and Threads
--------------------------------
Expand Down
1 change: 1 addition & 0 deletions Documentation/admin-guide/mm/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ the Linux memory management.
idle_page_tracking
ksm
memory-hotplug
multigen_lru
numa_memory_policy
numaperf
pagemap
Expand Down
162 changes: 162 additions & 0 deletions Documentation/admin-guide/mm/multigen_lru.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
.. SPDX-License-Identifier: GPL-2.0
=============
Multi-Gen LRU
=============
The multi-gen LRU is an alternative LRU implementation that optimizes
page reclaim and improves performance under memory pressure. Page
reclaim decides the kernel's caching policy and ability to overcommit
memory. It directly impacts the kswapd CPU usage and RAM efficiency.

Quick start
===========
Build the kernel with the following configurations.

* ``CONFIG_LRU_GEN=y``
* ``CONFIG_LRU_GEN_ENABLED=y``

All set!

Runtime options
===============
``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the
following subsections.

Kill switch
-----------
``enabled`` accepts different values to enable or disable the
following components. Its default value depends on
``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled
unless some of them have unforeseen side effects. Writing to
``enabled`` has no effect when a component is not supported by the
hardware, and valid values will be accepted even when the main switch
is off.

====== ===============================================================
Values Components
====== ===============================================================
0x0001 The main switch for the multi-gen LRU.
0x0002 Clearing the accessed bit in leaf page table entries in large
batches, when MMU sets it (e.g., on x86). This behavior can
theoretically worsen lock contention (mmap_lock). If it is
disabled, the multi-gen LRU will suffer a minor performance
degradation for workloads that contiguously map hot pages,
whose accessed bits can be otherwise cleared by fewer larger
batches.
0x0004 Clearing the accessed bit in non-leaf page table entries as
well, when MMU sets it (e.g., on x86). This behavior was not
verified on x86 varieties other than Intel and AMD. If it is
disabled, the multi-gen LRU will suffer a negligible
performance degradation.
[yYnN] Apply to all the components above.
====== ===============================================================

E.g.,
::

echo y >/sys/kernel/mm/lru_gen/enabled
cat /sys/kernel/mm/lru_gen/enabled
0x0007
echo 5 >/sys/kernel/mm/lru_gen/enabled
cat /sys/kernel/mm/lru_gen/enabled
0x0005

Thrashing prevention
--------------------
Personal computers are more sensitive to thrashing because it can
cause janks (lags when rendering UI) and negatively impact user
experience. The multi-gen LRU offers thrashing prevention to the
majority of laptop and desktop users who do not have ``oomd``.

Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of
``N`` milliseconds from getting evicted. The OOM killer is triggered
if this working set cannot be kept in memory. In other words, this
option works as an adjustable pressure relief valve, and when open, it
terminates applications that are hopefully not being used.

Based on the average human detectable lag (~100ms), ``N=1000`` usually
eliminates intolerable janks due to thrashing. Larger values like
``N=3000`` make janks less noticeable at the risk of premature OOM
kills.

The default value ``0`` means disabled.

Experimental features
=====================
``/sys/kernel/debug/lru_gen`` accepts commands described in the
following subsections. Multiple command lines are supported, so does
concatenation with delimiters ``,`` and ``;``.

``/sys/kernel/debug/lru_gen_full`` provides additional stats for
debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from
evicted generations in this file.

Working set estimation
----------------------
Working set estimation measures how much memory an application needs
in a given time interval, and it is usually done with little impact on
the performance of the application. E.g., data centers want to
optimize job scheduling (bin packing) to improve memory utilizations.
When a new job comes in, the job scheduler needs to find out whether
each server it manages can allocate a certain amount of memory for
this new job before it can pick a candidate. To do so, the job
scheduler needs to estimate the working sets of the existing jobs.

When it is read, ``lru_gen`` returns a histogram of numbers of pages
accessed over different time intervals for each memcg and node.
``MAX_NR_GENS`` decides the number of bins for each histogram. The
histograms are noncumulative.
::

memcg memcg_id memcg_path
node node_id
min_gen_nr age_in_ms nr_anon_pages nr_file_pages
...
max_gen_nr age_in_ms nr_anon_pages nr_file_pages

Each bin contains an estimated number of pages that have been accessed
within ``age_in_ms``. E.g., ``min_gen_nr`` contains the coldest pages
and ``max_gen_nr`` contains the hottest pages, since ``age_in_ms`` of
the former is the largest and that of the latter is the smallest.

Users can write the following command to ``lru_gen`` to create a new
generation ``max_gen_nr+1``:

``+ memcg_id node_id max_gen_nr [can_swap [force_scan]]``

``can_swap`` defaults to the swap setting and, if it is set to ``1``,
it forces the scan of anon pages when swap is off, and vice versa.
``force_scan`` defaults to ``1`` and, if it is set to ``0``, it
employs heuristics to reduce the overhead, which is likely to reduce
the coverage as well.

A typical use case is that a job scheduler runs this command at a
certain time interval to create new generations, and it ranks the
servers it manages based on the sizes of their cold pages defined by
this time interval.

Proactive reclaim
-----------------
Proactive reclaim induces page reclaim when there is no memory
pressure. It usually targets cold pages only. E.g., when a new job
comes in, the job scheduler wants to proactively reclaim cold pages on
the server it selected, to improve the chance of successfully landing
this new job.

Users can write the following command to ``lru_gen`` to evict
generations less than or equal to ``min_gen_nr``.

``- memcg_id node_id min_gen_nr [swappiness [nr_to_reclaim]]``

``min_gen_nr`` should be less than ``max_gen_nr-1``, since
``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to
the active list) and therefore cannot be evicted. ``swappiness``
overrides the default value in ``/proc/sys/vm/swappiness``.
``nr_to_reclaim`` limits the number of pages to evict.

A typical use case is that a job scheduler runs this command before it
tries to land a new job on a server. If it fails to materialize enough
cold pages because of the overestimation, it retries on the next
server according to the ranking result obtained from the working set
estimation step. This less forceful approach limits the impacts on the
existing jobs.
21 changes: 21 additions & 0 deletions Documentation/admin-guide/mm/transhuge.rst
Original file line number Diff line number Diff line change
Expand Up @@ -220,6 +220,13 @@ memory. A lower value can prevent THPs from being
collapsed, resulting fewer pages being collapsed into
THPs, and lower memory access performance.

``max_ptes_shared`` specifies how many pages can be shared across multiple
processes. Exceeding the number would block the collapse::

/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_shared

A higher value may increase memory footprint for some workloads.

Boot parameter
==============

Expand Down Expand Up @@ -310,6 +317,11 @@ thp_fault_fallback
is incremented if a page fault fails to allocate
a huge page and instead falls back to using small pages.

thp_fault_fallback_charge
is incremented if a page fault fails to charge a huge page and
instead falls back to using small pages even though the
allocation was successful.

thp_collapse_alloc_failed
is incremented if khugepaged found a range
of pages that should be collapsed into one huge page but failed
Expand All @@ -319,6 +331,15 @@ thp_file_alloc
is incremented every time a file huge page is successfully
allocated.

thp_file_fallback
is incremented if a file huge page is attempted to be allocated
but fails and instead falls back to using small pages.

thp_file_fallback_charge
is incremented if a file huge page cannot be charged and instead
falls back to using small pages even though the allocation was
successful.

thp_file_mapped
is incremented every time a file huge page is mapped into
user address space.
Expand Down
Loading

0 comments on commit f413d36

Please sign in to comment.