forked from jemalloc/jemalloc
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pull] dev from jemalloc:dev #103
Open
pull
wants to merge
163
commits into
jamlee-t:dev
Choose a base branch
from
jemalloc:dev
base: dev
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
+8,251
−2,349
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Fix or suppress the remaining warnings generated by static analysis. This is a necessary step before we can incorporate static analysis into CI. Where possible, I've preferred to modify the code itself instead of just disabling the warning with a magic comment, so that if we decide to use different static analysis tools in the future we will be covered against them raising similar warnings.
Now that all of the various issues that static analysis uncovered have been fixed (#2431, #2432, #2433, #2436, #2437, #2446), I've added a GitHub action which will run static analysis for every PR going forward. When static analysis detects issues with your code, the GitHub action provides a link to download its findings in a form tailored for human consumption. Take a look at [this demonstration of what it looks like when static analysis issues are found](https://github.com/Svetlitski/jemalloc/actions/runs/5010245602) on my fork for an example (make sure to follow the instructions in the error message to download and inspect the results).
Additionally, added a GitHub Action to ensure no more trailing whitespace will creep in again in the future. I'm excluding Markdown files from this check, since trailing whitespace is significant there, and also excluding `build-aux/install-sh` because there is significant trailing whitespace on the line that sets `defaultIFS`.
It turns out LLVM does not include a build for every platform in the assets for every release, just some of them. As such, I've pinned us to the latest release version with a corresponding build.
We have observed new workload patterns (namely ML training type) that cycle through oversized allocations frequently, because 1) the dataset might be sparse which is faster to go through, and 2) GPU accelerated. As a result, the eager purging from the oversize arena becomes a bottleneck. To offer an easy solution, allow normal purging of the oversized extents when background threads are enabled.
Previously, small allocations which were sampled as part of heap profiling were rounded up to `SC_LARGE_MINCLASS`. This additional memory usage becomes problematic when the page size is increased, as noted in #2358. Small allocations are now rounded up to the nearest multiple of `PAGE` instead, reducing the memory overhead by a factor of 4 in the most extreme cases.
Validate that small allocations (i.e. those with `size <= SC_SMALL_MAXCLASS`) which are sampled for profiling maintain the expected invariants even though they now take up less space.
For better or worse, Jemalloc has a significant number of global variables. Making all eligible global variables `static` and/or `const` at least makes it slightly easier to reason about them, as these qualifications communicate to the programmer restrictions on their use without having to `grep` the whole codebase.
Adding `-Wstrict-prototypes` to the default `CFLAGS` in PR #2473 had the non-obvious side-effect of breaking configure-time feature detection, because the [test-program `autoconf` generates for feature detection](https://www.gnu.org/software/autoconf/manual/autoconf-2.67/html_node/Generating-Sources.html#:~:text=main%20()) defines `main` as: ```c int main() ``` Which causes all feature checks to fail, since this triggers `-Wstrict-prototypes` and the feature checks use `-Werror`. Resolved by only adding `-Wstrict-prototypes` to `EXTRA_{CFLAGS,CXXFLAGS}` in CI, since these flags are not used during feature detection and we control which compiler is used.
For the sake of consistency, function definitions and their corresponding declarations should use the same names for parameters. I've enabled this check in static analysis to prevent this issue from occurring again in the future.
When stderr is a terminal and supports color, print error messages from tests in red to make them stand out from the surrounding output.
This makes it faster and easier to debug, so that you don't need to fire up a debugger just to see which assertion triggered in a failing test.
At least for LLVM, [casting from an integer to a pointer hides provenance information](https://clang.llvm.org/extra/clang-tidy/checks/performance/no-int-to-ptr.html) and inhibits optimizations. Here's a [Godbolt link](https://godbolt.org/z/5bYPcKoWT) showing how this change removes a couple unnecessary branches in `phn_merge_siblings`, which is a very hot function. Canary profiles show only minor improvements (since most of the cost of this function is in cache misses), but there's no reason we shouldn't take it.
This is a prerequisite to achieving self-contained headers. Previously, the various tsd implementation headers (`tsd_generic.h`, `tsd_tls.h`, `tsd_malloc_thread_cleanup.h`, and `tsd_win.h`) relied implicitly on being included in `tsd.h` after a variety of dependencies had been defined above them. This commit instead makes these dependencies explicit by splitting them out into a separate file, `tsd_internals.h`, which each of the tsd implementation headers includes directly.
Header files are now self-contained, which makes the relationships between the files clearer, and crucially allows LSP tools like `clangd` to function correctly in all of our header files. I have verified that the headers are self-contained (aside from the various Windows shims) by compiling them as if they were C files – in a follow-up commit I plan to add this to CI to ensure we don't regress on this front.
[N2699 - Sized Memory Deallocation](https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2699.htm) introduced two new functions which were incorporated into the C23 standard, `free_sized` and `free_aligned_sized`. Both already have analogues in Jemalloc, all we are doing here is adding the appropriate wrappers.
These warnings are not useful, and make the output of some CI jobs enormous and difficult to read, so let's suppress them.
This makes the code more readable on its own, and also sets the stage for more cleanly handling the pointer provenance lints in a following commit.
Following from PR #2481, we replace all integer-to-pointer casts [which hide pointer provenance information (and thus inhibit optimizations)](https://clang.llvm.org/extra/clang-tidy/checks/performance/no-int-to-ptr.html) with equivalent operations that preserve this information. I have enabled the corresponding clang-tidy check in our static analysis CI so that we do not get bitten by this again in the future.
In an attempt to make all headers self-contained, I inadvertently added `#include`s which refer to intermediate, generated headers that aren't included in the final install. Closes #2489.
While this function isn't particularly hot, (accounting for just 0.27% of time spent inside the allocator on average across the fleet), looking at the generated assembly and performance profiles does show we're dispatching to multiple different `memset`s when we could instead be just tail-calling `memset` once, reducing code size and marginally improving performance.
…set` Sampled allocations were not being demoted before being deallocated during an `arena_reset` operation.
Previously the option causing trouble will not be printed, unless the option key:value pair format is found.
pthread_cond_wait drops and re-acquires the mutex internally, w/o going through our wrapper. Update the locked state explicitly.
On MSVC, log is an intrinsic that doesn't require libm. However, AC_SEARCH_LIBS does not successfully detect this, as it will try to compile a program using the wrong signature for log. Newer versions of MSVC CL detects this and rejects the program with the following messages: conftest.c(40): warning C4391: 'char log()': incorrect return type for intrinsic function, expected 'double' conftest.c(44): error C2168: 'log': too few actual parameters for intrinsic function Since log is always available on MSVC (it's been around since the dawn of time), we simply always assume it's there if MSVC is detected.
During boot, some mutexes are not initialized yet, plus there's no point taking many mutexes while everything is covered by the global init lock, so the locking assumptions in some functions (e.g. background_thread_enabled_set()) can't be enforced. Skip the lock owner check in this case.
Option `experimental_hpa_strict_min_purge_interval` was expected to be temporary to simplify rollout of a bugfix. Now, when bugfix rollout is complete it is safe to remove this option.
It has been disabled since 5.2.0 (in #1421).
Prefer clock_gettime_nsec_np(CLOCK_UPTIME_RAW) to mach_absolute_time().
Milliseconds are used a lot in hpa, so it is convenient to have `nstime_ms_since` function instead of dividing to `MILLION` constantly. For consistency renamed `nstime_msec` to `nstime_ms` as `ms` abbreviation is used much more commonly across codebase than `msec`. ``` $ grep -Rn '_msec' include src | wc -l 2 $ grep -RPn '_ms( |,|:)' include src | wc -l 72 ``` Function `nstime_msec` wasn't used anywhere in the code yet.
Make multiple functions from `stats_arena_hpa_shard_print` for readability and ease of change in the future.
Linux 6.1 introduced `MADV_COLLAPSE` flag to perform a best-effort synchronous collapse of the native pages mapped by the memory range into transparent huge pages. Synchronous hugification might be beneficial for at least two reasons: we are not relying on khugepaged anymore and get an instant feedback if range wasn't hugified. If `hpa_hugify_sync` option is on, we'll try to perform synchronously collapse and if it wasn't successful, we'll fallback to asynchronous behaviour.
Config validation was introduced at 3aae792 with main intention to fix infinite purging loop, but it didn't actually fix the underlying problem, just masked it. Later 47d69b4 was merged to address the same problem. Options `hpa_dirty_mult` and `hpa_hugification_threshold` have different application dimensions: `hpa_dirty_mult` applied to active memory on the shard, but `hpa_hugification_threshold` is a threshold for single pageslab (hugepage). It doesn't make much sense to sum them up together. While it is true that too high value of `hpa_dirty_mult` and too low value of `hpa_hugification_threshold` can lead to pathological behaviour, it is true for other options as well. Poor configurations might lead to suboptimal and sometimes completely unacceptable behaviour and that's OK, that is exactly the reason why they are called poor. There are other mechanism exist to prevent extreme behaviour, when we hugified and then immediately purged page, see `hpa_hugify_blocked_by_ndirty` function, which exist to prevent exactly this case. Lastly, `hpa_dirty_mult + hpa_hugification_threshold >= 1` constraint is too tight and prevents a lot of valid configurations.
When evaluating changes in HPA logic, it is useful to know internal `hpa_shard` state. Great deal of this state is `psset`. Some of the `psset` stats was available, but in disaggregated form, which is not very convenient. This commit exposed `psset` counters to `mallctl` and malloc stats dumps. Example of how malloc stats dump will look like after the change. HPA shard stats: Pageslabs: 14899 (4354 huge, 10545 nonhuge) Active pages: 6708166 (2228917 huge, 4479249 nonhuge) Dirty pages: 233816 (331 huge, 233485 nonhuge) Retained pages: 686306 Purge passes: 8730 (10 / sec) Purges: 127501 (146 / sec) Hugeifies: 4358 (5 / sec) Dehugifies: 4 (0 / sec) Pageslabs, active pages, dirty pages and retained pages are rows added by this change.
We are trying to create `ncpus * 2` threads for this test and place them into `VARIABLE_ARRAY`, but `VARIABLE_ARRAY` can not be more than `VARIABLE_ARRAY_SIZE_MAX` bytes. When there are a lot of threads on the box test always fails. ``` $ nproc 176 $ make -j`nproc` tests_unit && ./test/unit/retained <jemalloc>: ../test/unit/retained.c:123: Failed assertion: "sizeof(thd_t) * (nthreads) <= VARIABLE_ARRAY_SIZE_MAX" Aborted (core dumped) ``` There is no need for high concurrency for this test as we are only checking stats there and it's behaviour is quite stable regarding number of allocating threads. Limited number of threads to 16 to save compute resources (on CI for example) and reduce tests running time. Before the change (`nproc` is 80 on this box). ``` $ make -j`nproc` tests_unit && time ./test/unit/retained <...> real 0m0.372s user 0m14.236s sys 0m12.338s ``` After the change (same box). ``` $ make -j`nproc` tests_unit && time ./test/unit/retained <...> real 0m0.018s user 0m0.108s sys 0m0.068s ```
Taken from https://android-review.git.corp.google.com/c/platform/external/jemalloc_new/+/3316478 This might need more cleanups to remove the definition of JEMALLOC_INTERNAL_UNREACHABLE.
The gettid() function is available on Linux in glibc only since version 2.30. There are supported distributions that still use older glibc version. Thus add a configure check if the gettid() function is available and extend the check in src/prof_stack_range.c so it's skipped also when gettid() isn't available. Fixes: #2740
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by pull[bot]
Can you help keep this open source service alive? 💖 Please sponsor : )