[New section] Low Latency Tuning Techniques #33

dendibakh · 2023-11-10T12:33:48Z

Author: Mark Dawson

dendibakh · 2023-11-11T02:43:21Z

chapters/13-Other-Tuning-Areas/13-4 Low-Latency-Tuning-Techniques.md

+More importantly, all that heap memory that was pre-faulted in the for-loop will persist in RAM due to the previous mlockall call – the option MCL_CURRENT locks all pages which are currently mapped, while MCL_FUTURE locks all pages that will become mapped in the future. An added benefit of using mlockall this way is that any thread spawned by this process will have its stack pre-faulted and locked, as well.
+
+These are just two toy example methods for preventing runtime minor faults. Similar techniques may be employed using alternative allocators (e.g., jemalloc, tcmalloc, mimalloc, etc.) or STL features (e.g., creative use of PMR allocators/memory_resources).
+


On Windows:
Lock pages with VirtualLock: https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-virtuallock
To avoid immediately releasing memory to the free page list use VirtualFree with MEM_DECOMMIT, but not MEM_RELEASE flag. https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-virtualfree

dendibakh · 2023-11-11T03:49:02Z

chapters/13-Other-Tuning-Areas/13-4 Low-Latency-Tuning-Techniques.md

+Though a developer may avoid explicitly using these syscalls in his/her code, TLB Shootdowns may still erupt from external sources – e.g., allocator shared libraries or OS facilities. Not only will this type of IPI disrupt runtime application performance, but the magnitude of its impact grows with the number of threads involved since the interrupts are delivered in software.
+
+Preventing TLB Shootdowns requires limiting the number of updates made to the shared process address space – e.g., avoiding runtime execution of the aforementioned list of syscalls. Also, disable kernel features which induce TLB Shootdowns as a consequence of its function, such as Transparent Huge Pages and Automatic NUMA Balancing.
+


[ADD]: Because they either relocate pages and/or alter permissions on pages in the process of fulfilling its duties, which require page table updates.

dendibakh · 2023-11-28T16:59:01Z

chapters/13-Other-Tuning-Areas/13-4 Low-Latency-Tuning-Techniques.md

+While the term contains the word “minor”, there’s nothing minor about the impact of minor page faults on runtime latency if, for example, you work in the HFT industry where every microsecond and nanosecond count. Latency impact of minor faults can range from just under a microsecond up to several microseconds, especially if you’re using a Linux kernel with 5-level page tables instead of 4-level page tables.
+
+How do you detect runtime minor page faults in your application? One simple way is by using the “top” utility (add the “-H” option for a thread-level view). Add the “vMn” field to the default selection of display columns to view the number of minor page faults occurring per display refresh interval. Another way involves attaching to the running process with “perf stat -e page-faults”. In the HFT world, anything more than ‘0’ is a problem. But for low latency applications in other business domains, a constant occurrence in the range of tens to hundreds of faults per interval should prompt further investigation.
+


of tens to hundreds of faults per interval

per what interval? 1s/1m/1h ?

dendibakh · 2023-11-28T17:04:39Z

chapters/13-Other-Tuning-Areas/13-4 Low-Latency-Tuning-Techniques.md

+
+How do you detect runtime minor page faults in your application? One simple way is by using the “top” utility (add the “-H” option for a thread-level view). Add the “vMn” field to the default selection of display columns to view the number of minor page faults occurring per display refresh interval. Another way involves attaching to the running process with “perf stat -e page-faults”. In the HFT world, anything more than ‘0’ is a problem. But for low latency applications in other business domains, a constant occurrence in the range of tens to hundreds of faults per interval should prompt further investigation.
+
+Investigating the root cause of runtime minor page faults can be as simple as firing up “perf record -e page-faults” and then “perf report” to locate offending source code lines.


Maybe add an example?
[For Denis]: would be a good candidate for perf-ninja.

dendibakh · 2023-11-28T17:22:54Z

chapters/13-Other-Tuning-Areas/13-4 Low-Latency-Tuning-Techniques.md

+- Setting M_TRIM_THRESHOLD to ‘-1’ prevents glibc from returning memory to the OS after calls to free() (NOTE: as indicated before, this option has no effect on mmap-ed segments).
+- Finally, setting M_ARENA_MAX to ‘1’ prevents glibc from allocating multiple arenas via mmap() to accommodate multiple cores. CAUTION: the latter hinders the glibc allocator’s multithreaded scalability feature.
+
+Combined, these settings force glibc into sbrk-only heap allocations which will not release its memory back to the OS until the application ends. As a result, the heap will remain the same size after the final call to “free(mem)” in the code above. Any subsequent runtime calls to malloc() or new() simply will reuse space in this pre-allocated/pre-faulted heap area if it is sufficiently sized at initialization.


We need to explain the term sbrk-only heap allocations.

dendibakh · 2023-11-28T19:25:11Z

chapters/13-Other-Tuning-Areas/13-4 Low-Latency-Tuning-Techniques.md


 Though a developer may avoid explicitly using these syscalls in his/her code, TLB Shootdowns may still erupt from external sources – e.g., allocator shared libraries or OS facilities. Not only will this type of IPI disrupt runtime application performance, but the magnitude of its impact grows with the number of threads involved since the interrupts are delivered in software.

-Preventing TLB Shootdowns requires limiting the number of updates made to the shared process address space – e.g., avoiding runtime execution of the aforementioned list of syscalls. Also, disable kernel features which induce TLB Shootdowns as a consequence of its function, such as Transparent Huge Pages and Automatic NUMA Balancing.
+How do you detect TLB Shootdowns in your multithreaded application? One simple way is to check the TLB row in /proc/interrupts. A useful method of detecting continuous TLB interrupts during runtime is to use the “watch” command while viewing this file. For example, you might run “watch -n5 -d ‘grep TLB /proc/interrupts’” – the “-n 5” option refreshes the view every 5 seconds while “-d” highlights the delta between each refresh output. 


Insert an example of a dump from /proc/interrupts similar to the one in https://www.jabperf.com/how-to-deter-or-disarm-tlb-shootdowns

CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 45: 0 0 0 0 0 0 0 0 PCI-MSI-edge eth0 46: 192342 0 0 0 0 0 0 0 PCI-MSI-edge ahci 47: 14 0 0 0 0 0 0 0 PCI-MSI-edge mei NMI: 0 0 0 0 0 0 0 0 Non-maskable interrupts LOC: 552219 1010298 2272333 3179890 1445484 1226202 1800191 1894825 Local timer interrupts SPU: 0 0 0 0 0 0 0 0 Spurious interrupts IWI: 0 0 0 0 0 0 0 0 IRQ work interrupts RTR: 7 0 0 0 0 0 0 0 APIC ICR read retries RES: 18708 9550 771 528 129 170 151 139 Rescheduling interrupts CAL: 711 934 1312 1261 1446 1411 1433 1432 Function call interrupts TLB: 4493 6108 73789 5014 1788 2327 1967 914 TLB shootdowns

Also, one can check the profile (perf report) for kernel functions responsible for handling TLB shootdowns, e.g. native_flush_tlb_single and its callees.

dendibakh · 2023-11-30T18:35:18Z

Reviewed offline with Mark.

dbakhval added 2 commits November 10, 2023 07:01

[New section] Low Latency Tuning Techniques

0d1f2df

v3

be295bd

dendibakh commented Nov 11, 2023

View reviewed changes

v5

15ff5fd

dendibakh commented Nov 28, 2023

View reviewed changes

dbakhval added 4 commits November 29, 2023 12:55

Cosmetics. part1

d6936a3

Cosmetics. part2

9fd6ad3

Fixed Mark's comments

222f1a4

Added link

854b170

dendibakh merged commit 1e0f50c into main Nov 30, 2023
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[New section] Low Latency Tuning Techniques #33

[New section] Low Latency Tuning Techniques #33

dendibakh commented Nov 10, 2023

dendibakh Nov 11, 2023

dendibakh Nov 11, 2023

dendibakh Nov 28, 2023

dendibakh Nov 28, 2023

dendibakh Nov 28, 2023

dendibakh Nov 28, 2023 •

edited

Loading

dendibakh Nov 28, 2023

dendibakh commented Nov 30, 2023

		More importantly, all that heap memory that was pre-faulted in the for-loop will persist in RAM due to the previous mlockall call – the option MCL_CURRENT locks all pages which are currently mapped, while MCL_FUTURE locks all pages that will become mapped in the future. An added benefit of using mlockall this way is that any thread spawned by this process will have its stack pre-faulted and locked, as well.

		These are just two toy example methods for preventing runtime minor faults. Similar techniques may be employed using alternative allocators (e.g., jemalloc, tcmalloc, mimalloc, etc.) or STL features (e.g., creative use of PMR allocators/memory_resources).

		Though a developer may avoid explicitly using these syscalls in his/her code, TLB Shootdowns may still erupt from external sources – e.g., allocator shared libraries or OS facilities. Not only will this type of IPI disrupt runtime application performance, but the magnitude of its impact grows with the number of threads involved since the interrupts are delivered in software.

		Preventing TLB Shootdowns requires limiting the number of updates made to the shared process address space – e.g., avoiding runtime execution of the aforementioned list of syscalls. Also, disable kernel features which induce TLB Shootdowns as a consequence of its function, such as Transparent Huge Pages and Automatic NUMA Balancing.

		While the term contains the word “minor”, there’s nothing minor about the impact of minor page faults on runtime latency if, for example, you work in the HFT industry where every microsecond and nanosecond count. Latency impact of minor faults can range from just under a microsecond up to several microseconds, especially if you’re using a Linux kernel with 5-level page tables instead of 4-level page tables.

		How do you detect runtime minor page faults in your application? One simple way is by using the “top” utility (add the “-H” option for a thread-level view). Add the “vMn” field to the default selection of display columns to view the number of minor page faults occurring per display refresh interval. Another way involves attaching to the running process with “perf stat -e page-faults”. In the HFT world, anything more than ‘0’ is a problem. But for low latency applications in other business domains, a constant occurrence in the range of tens to hundreds of faults per interval should prompt further investigation.


		How do you detect runtime minor page faults in your application? One simple way is by using the “top” utility (add the “-H” option for a thread-level view). Add the “vMn” field to the default selection of display columns to view the number of minor page faults occurring per display refresh interval. Another way involves attaching to the running process with “perf stat -e page-faults”. In the HFT world, anything more than ‘0’ is a problem. But for low latency applications in other business domains, a constant occurrence in the range of tens to hundreds of faults per interval should prompt further investigation.

		Investigating the root cause of runtime minor page faults can be as simple as firing up “perf record -e page-faults” and then “perf report” to locate offending source code lines.

[New section] Low Latency Tuning Techniques #33

[New section] Low Latency Tuning Techniques #33

Conversation

dendibakh commented Nov 10, 2023

dendibakh Nov 11, 2023

Choose a reason for hiding this comment

dendibakh Nov 11, 2023

Choose a reason for hiding this comment

dendibakh Nov 28, 2023

Choose a reason for hiding this comment

dendibakh Nov 28, 2023

Choose a reason for hiding this comment

dendibakh Nov 28, 2023

Choose a reason for hiding this comment

dendibakh Nov 28, 2023 • edited Loading

Choose a reason for hiding this comment

dendibakh Nov 28, 2023

Choose a reason for hiding this comment

dendibakh commented Nov 30, 2023

dendibakh Nov 28, 2023 •

edited

Loading