[pull] dev from jemalloc:dev #103

pull · 2023-06-23T19:10:56Z

See Commits and Changes for more details.

Can you help keep this open source service alive? 💖 Please sponsor : )

Fix or suppress the remaining warnings generated by static analysis. This is a necessary step before we can incorporate static analysis into CI. Where possible, I've preferred to modify the code itself instead of just disabling the warning with a magic comment, so that if we decide to use different static analysis tools in the future we will be covered against them raising similar warnings.

Now that all of the various issues that static analysis uncovered have been fixed (#2431, #2432, #2433, #2436, #2437, #2446), I've added a GitHub action which will run static analysis for every PR going forward. When static analysis detects issues with your code, the GitHub action provides a link to download its findings in a form tailored for human consumption. Take a look at [this demonstration of what it looks like when static analysis issues are found](https://github.com/Svetlitski/jemalloc/actions/runs/5010245602) on my fork for an example (make sure to follow the instructions in the error message to download and inspect the results).

Additionally, added a GitHub Action to ensure no more trailing whitespace will creep in again in the future. I'm excluding Markdown files from this check, since trailing whitespace is significant there, and also excluding `build-aux/install-sh` because there is significant trailing whitespace on the line that sets `defaultIFS`.

It turns out LLVM does not include a build for every platform in the assets for every release, just some of them. As such, I've pinned us to the latest release version with a corresponding build.

We have observed new workload patterns (namely ML training type) that cycle through oversized allocations frequently, because 1) the dataset might be sparse which is faster to go through, and 2) GPU accelerated. As a result, the eager purging from the oversize arena becomes a bottleneck. To offer an easy solution, allow normal purging of the oversized extents when background threads are enabled.

Previously, small allocations which were sampled as part of heap profiling were rounded up to `SC_LARGE_MINCLASS`. This additional memory usage becomes problematic when the page size is increased, as noted in #2358. Small allocations are now rounded up to the nearest multiple of `PAGE` instead, reducing the memory overhead by a factor of 4 in the most extreme cases.

Validate that small allocations (i.e. those with `size <= SC_SMALL_MAXCLASS`) which are sampled for profiling maintain the expected invariants even though they now take up less space.

For better or worse, Jemalloc has a significant number of global variables. Making all eligible global variables `static` and/or `const` at least makes it slightly easier to reason about them, as these qualifications communicate to the programmer restrictions on their use without having to `grep` the whole codebase.

Adding `-Wstrict-prototypes` to the default `CFLAGS` in PR #2473 had the non-obvious side-effect of breaking configure-time feature detection, because the [test-program `autoconf` generates for feature detection](https://www.gnu.org/software/autoconf/manual/autoconf-2.67/html_node/Generating-Sources.html#:~:text=main%20()) defines `main` as: ```c int main() ``` Which causes all feature checks to fail, since this triggers `-Wstrict-prototypes` and the feature checks use `-Werror`. Resolved by only adding `-Wstrict-prototypes` to `EXTRA_{CFLAGS,CXXFLAGS}` in CI, since these flags are not used during feature detection and we control which compiler is used.

For the sake of consistency, function definitions and their corresponding declarations should use the same names for parameters. I've enabled this check in static analysis to prevent this issue from occurring again in the future.

When stderr is a terminal and supports color, print error messages from tests in red to make them stand out from the surrounding output.

This makes it faster and easier to debug, so that you don't need to fire up a debugger just to see which assertion triggered in a failing test.

At least for LLVM, [casting from an integer to a pointer hides provenance information](https://clang.llvm.org/extra/clang-tidy/checks/performance/no-int-to-ptr.html) and inhibits optimizations. Here's a [Godbolt link](https://godbolt.org/z/5bYPcKoWT) showing how this change removes a couple unnecessary branches in `phn_merge_siblings`, which is a very hot function. Canary profiles show only minor improvements (since most of the cost of this function is in cache misses), but there's no reason we shouldn't take it.

This is a prerequisite to achieving self-contained headers. Previously, the various tsd implementation headers (`tsd_generic.h`, `tsd_tls.h`, `tsd_malloc_thread_cleanup.h`, and `tsd_win.h`) relied implicitly on being included in `tsd.h` after a variety of dependencies had been defined above them. This commit instead makes these dependencies explicit by splitting them out into a separate file, `tsd_internals.h`, which each of the tsd implementation headers includes directly.

Header files are now self-contained, which makes the relationships between the files clearer, and crucially allows LSP tools like `clangd` to function correctly in all of our header files. I have verified that the headers are self-contained (aside from the various Windows shims) by compiling them as if they were C files – in a follow-up commit I plan to add this to CI to ensure we don't regress on this front.

[N2699 - Sized Memory Deallocation](https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2699.htm) introduced two new functions which were incorporated into the C23 standard, `free_sized` and `free_aligned_sized`. Both already have analogues in Jemalloc, all we are doing here is adding the appropriate wrappers.

These warnings are not useful, and make the output of some CI jobs enormous and difficult to read, so let's suppress them.

This makes the code more readable on its own, and also sets the stage for more cleanly handling the pointer provenance lints in a following commit.

Following from PR #2481, we replace all integer-to-pointer casts [which hide pointer provenance information (and thus inhibit optimizations)](https://clang.llvm.org/extra/clang-tidy/checks/performance/no-int-to-ptr.html) with equivalent operations that preserve this information. I have enabled the corresponding clang-tidy check in our static analysis CI so that we do not get bitten by this again in the future.

In an attempt to make all headers self-contained, I inadvertently added `#include`s which refer to intermediate, generated headers that aren't included in the final install. Closes #2489.

While this function isn't particularly hot, (accounting for just 0.27% of time spent inside the allocator on average across the fleet), looking at the generated assembly and performance profiles does show we're dispatching to multiple different `memset`s when we could instead be just tail-calling `memset` once, reducing code size and marginally improving performance.

…l_overrides.h`

…set` Sampled allocations were not being demoted before being deallocated during an `arena_reset` operation.

Previously the option causing trouble will not be printed, unless the option key:value pair format is found.

pthread_cond_wait drops and re-acquires the mutex internally, w/o going through our wrapper. Update the locked state explicitly.

On MSVC, log is an intrinsic that doesn't require libm. However, AC_SEARCH_LIBS does not successfully detect this, as it will try to compile a program using the wrong signature for log. Newer versions of MSVC CL detects this and rejects the program with the following messages: conftest.c(40): warning C4391: 'char log()': incorrect return type for intrinsic function, expected 'double' conftest.c(44): error C2168: 'log': too few actual parameters for intrinsic function Since log is always available on MSVC (it's been around since the dawn of time), we simply always assume it's there if MSVC is detected.

During boot, some mutexes are not initialized yet, plus there's no point taking many mutexes while everything is covered by the global init lock, so the locking assumptions in some functions (e.g. background_thread_enabled_set()) can't be enforced. Skip the lock owner check in this case.

Option `experimental_hpa_strict_min_purge_interval` was expected to be temporary to simplify rollout of a bugfix. Now, when bugfix rollout is complete it is safe to remove this option.

It has been disabled since 5.2.0 (in #1421).

Prefer clock_gettime_nsec_np(CLOCK_UPTIME_RAW) to mach_absolute_time().

Milliseconds are used a lot in hpa, so it is convenient to have `nstime_ms_since` function instead of dividing to `MILLION` constantly. For consistency renamed `nstime_msec` to `nstime_ms` as `ms` abbreviation is used much more commonly across codebase than `msec`. ``` $ grep -Rn '_msec' include src | wc -l 2 $ grep -RPn '_ms( |,|:)' include src | wc -l 72 ``` Function `nstime_msec` wasn't used anywhere in the code yet.

Make multiple functions from `stats_arena_hpa_shard_print` for readability and ease of change in the future.

Linux 6.1 introduced `MADV_COLLAPSE` flag to perform a best-effort synchronous collapse of the native pages mapped by the memory range into transparent huge pages. Synchronous hugification might be beneficial for at least two reasons: we are not relying on khugepaged anymore and get an instant feedback if range wasn't hugified. If `hpa_hugify_sync` option is on, we'll try to perform synchronously collapse and if it wasn't successful, we'll fallback to asynchronous behaviour.

Config validation was introduced at 3aae792 with main intention to fix infinite purging loop, but it didn't actually fix the underlying problem, just masked it. Later 47d69b4 was merged to address the same problem. Options `hpa_dirty_mult` and `hpa_hugification_threshold` have different application dimensions: `hpa_dirty_mult` applied to active memory on the shard, but `hpa_hugification_threshold` is a threshold for single pageslab (hugepage). It doesn't make much sense to sum them up together. While it is true that too high value of `hpa_dirty_mult` and too low value of `hpa_hugification_threshold` can lead to pathological behaviour, it is true for other options as well. Poor configurations might lead to suboptimal and sometimes completely unacceptable behaviour and that's OK, that is exactly the reason why they are called poor. There are other mechanism exist to prevent extreme behaviour, when we hugified and then immediately purged page, see `hpa_hugify_blocked_by_ndirty` function, which exist to prevent exactly this case. Lastly, `hpa_dirty_mult + hpa_hugification_threshold >= 1` constraint is too tight and prevents a lot of valid configurations.

When evaluating changes in HPA logic, it is useful to know internal `hpa_shard` state. Great deal of this state is `psset`. Some of the `psset` stats was available, but in disaggregated form, which is not very convenient. This commit exposed `psset` counters to `mallctl` and malloc stats dumps. Example of how malloc stats dump will look like after the change. HPA shard stats: Pageslabs: 14899 (4354 huge, 10545 nonhuge) Active pages: 6708166 (2228917 huge, 4479249 nonhuge) Dirty pages: 233816 (331 huge, 233485 nonhuge) Retained pages: 686306 Purge passes: 8730 (10 / sec) Purges: 127501 (146 / sec) Hugeifies: 4358 (5 / sec) Dehugifies: 4 (0 / sec) Pageslabs, active pages, dirty pages and retained pages are rows added by this change.

We are trying to create `ncpus * 2` threads for this test and place them into `VARIABLE_ARRAY`, but `VARIABLE_ARRAY` can not be more than `VARIABLE_ARRAY_SIZE_MAX` bytes. When there are a lot of threads on the box test always fails. ``` $ nproc 176 $ make -j`nproc` tests_unit && ./test/unit/retained <jemalloc>: ../test/unit/retained.c:123: Failed assertion: "sizeof(thd_t) * (nthreads) <= VARIABLE_ARRAY_SIZE_MAX" Aborted (core dumped) ``` There is no need for high concurrency for this test as we are only checking stats there and it's behaviour is quite stable regarding number of allocating threads. Limited number of threads to 16 to save compute resources (on CI for example) and reduce tests running time. Before the change (`nproc` is 80 on this box). ``` $ make -j`nproc` tests_unit && time ./test/unit/retained <...> real 0m0.372s user 0m14.236s sys 0m12.338s ``` After the change (same box). ``` $ make -j`nproc` tests_unit && time ./test/unit/retained <...> real 0m0.018s user 0m0.108s sys 0m0.068s ```

Taken from https://android-review.git.corp.google.com/c/platform/external/jemalloc_new/+/3316478 This might need more cleanups to remove the definition of JEMALLOC_INTERNAL_UNREACHABLE.

The gettid() function is available on Linux in glibc only since version 2.30. There are supported distributions that still use older glibc version. Thus add a configure check if the gettid() function is available and extend the check in src/prof_stack_range.c so it's skipped also when gettid() isn't available. Fixes: #2740

Svetlitski added 3 commits June 23, 2023 11:50

pull bot added the ⤵️ pull label Jun 23, 2023

Svetlitski and others added 26 commits June 23, 2023 14:30

Fix downloading LLVM in GitHub Action

46e464a

It turns out LLVM does not include a build for every platform in the assets for every release, just some of them. As such, I've pinned us to the latest release version with a corresponding build.

Address compiler warnings in the unit tests

e133870

Add a test-case for small profiled allocations

ebd7e99

Validate that small allocations (i.e. those with `size <= SC_SMALL_MAXCLASS`) which are sampled for profiling maintain the expected invariants even though they now take up less space.

Enabled -Wstrict-prototypes and fixed warnings.

602edd7

Remove unreachable code.

e249d1a

Print test error messages in color when stderr is a terminal

65d3b59

When stderr is a terminal and supports color, print error messages from tests in red to make them stand out from the surrounding output.

Print the failed assertion before aborting in test cases

314c073

This makes it faster and easier to debug, so that you don't need to fire up a debugger just to see which assertion triggered in a failing test.

Suppress verbose frame address warnings

c49c17f

These warnings are not useful, and make the output of some CI jobs enormous and difficult to read, so let's suppress them.

Define PROF_TCTX_SENTINEL instead of using magic numbers

7e54dd1

This makes the code more readable on its own, and also sets the stage for more cleanly handling the pointer provenance lints in a following commit.

Define SBRK_INVALID instead of using a magic number

1431153

Remove vestigial TCACHE_STATE_* macros

4827bb1

Remove errant #includes in public jemalloc.h header

8ff7e7d

In an attempt to make all headers self-contained, I inadvertently added `#include`s which refer to intermediate, generated headers that aren't included in the final install. Closes #2489.

Add an override for the compile-time malloc_conf to `jemalloc_interna…

b01d496

…l_overrides.h`

Ensured sampled allocations are properly deallocated during `arena_re…

62648c8

…set` Sampled allocations were not being demoted before being deallocated during an `arena_reset` operation.

Include the unrecognized malloc conf option in the error message.

6816b23

Previously the option causing trouble will not be printed, unless the option key:value pair format is found.

interwq and others added 30 commits September 20, 2024 16:56

Fix the locked flag for malloc_mutex_trylock().

661fb1e

Fix mutex state tracking around pthread_cond_wait().

3eb7a4b

pthread_cond_wait drops and re-acquires the mutex internally, w/o going through our wrapper. Update the locked state explicitly.

Add malloc_mutex_is_locked() sanity checks.

1960536

Fix a missing init value warning caught by static analysis.

de5606d

Optimize edata_cmp_summary_compare when __uint128_t is available

0181aaa

Assert the mutex is locked within malloc_mutex_assert_owner().

6cc4217

Remove strict_min_purge_interval option

4f4fd42

Option `experimental_hpa_strict_min_purge_interval` was expected to be temporary to simplify rollout of a bugfix. Now, when bugfix rollout is complete it is safe to remove this option.

Do not support hpa if HUGEPAGE is too large.

1c90008

Use MSVC __declspec(thread) for TSD on Windows

3a0d9cd

Add safe frame-pointer backtrace unwinder

edc1576

Update doc to reflect muzzy decay is disabled by default.

8c2b8bc

It has been disabled since 5.2.0 (in #1421).

Update the configure cache file example in INSTALL.md

02251c0

Updated jeprof with more symbols to filter.

397827a

Add support for clock_gettime_nsec_np()

6d625d5

Prefer clock_gettime_nsec_np(CLOCK_UPTIME_RAW) to mach_absolute_time().

Fix the sized-dealloc safety check abort msg.

2a693b8

Split stats_arena_hpa_shard_print function

b82333f

Make multiple functions from `stats_arena_hpa_shard_print` for readability and ease of change in the future.

Move je_cv_thp logic closer to definition

a361e88

Fix ehooks assertion for arena creation

6786934

Enable large hugepage tests for arm64 on Travis

a17385a

Disable psset test when hugepage size is too large.

587676f

Remove unreachable() macro as c23 already defines it.

d8486b2

Taken from https://android-review.git.corp.google.com/c/platform/external/jemalloc_new/+/3316478 This might need more cleanups to remove the definition of JEMALLOC_INTERNAL_UNREACHABLE.

Conditionally remove unreachable for C23+

4b88bdd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] dev from jemalloc:dev #103

[pull] dev from jemalloc:dev #103

pull bot commented Jun 23, 2023 •

edited

Loading

[pull] dev from jemalloc:dev #103

Are you sure you want to change the base?

[pull] dev from jemalloc:dev #103

Conversation

pull bot commented Jun 23, 2023 • edited Loading

pull bot commented Jun 23, 2023 •

edited

Loading