Skip to content

Commit

Permalink
Allocation profiler (#42768)
Browse files Browse the repository at this point in the history
## Overview

Record the type and stack of every allocation (or only at a given sample interval), and return as Julia objects.

Alternate approach to existing alloc profiler PR: #33467
Complementary to garbage profiler PR: #42658 (maybe there's some nice way to meld them)

This may be reinventing the wheel from #33467, but I'm not sure why that one needs stuff like LLVM passes. I mimicked some stuff from it, but this was my attempt to get something up and running. Could easily be missing stuff.

## Usage:

```julia
using Profile.Allocs
res = Allocs.@Profile sample_rate=0.001 my_func()
prof = Allocs.fetch()
# do something with `prof`
```

See also: JuliaPerf/PProf.jl#46 for support for visualizing these.

Co-authored-by: Nathan Daly <[email protected]>
  • Loading branch information
vilterp and NHDaly authored Jan 19, 2022
1 parent 5864e43 commit ef23d6d
Show file tree
Hide file tree
Showing 13 changed files with 620 additions and 2 deletions.
3 changes: 3 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,9 @@ Standard library changes
Further, percent utilization is now reported as a total or per-thread, based on whether the thread is idle or not at
each sample. `Profile.fetch()` by default strips out the new metadata to ensure backwards compatibility with external
profiling data consumers, but can be included with the `include_meta` kwarg. ([#41742])
* The new `Profile.Allocs` module allows memory allocations to be profiled. The stack trace, type, and size of each
allocation is recorded, and a `sample_rate` argument allows a tunable amount of allocations to be skipped,
reducing performance overhead. ([#42768])

#### Random

Expand Down
25 changes: 25 additions & 0 deletions doc/src/manual/profile.md
Original file line number Diff line number Diff line change
Expand Up @@ -336,6 +336,31 @@ and how much garbage it collects each time. This can be enabled with
[`GC.enable_logging(true)`](@ref), which causes Julia to log to stderr every time
a garbage collection happens.

### Allocation Profiler

The allocation profiler records the stack trace, type, and size of each
allocation while it is running. It can be invoked with
[`Profile.Allocs.@profile`](@ref).

This information about the allocations is returned as an array of `Alloc`
objects, wrapped in an `AllocResults` object. The best way to visualize
these is currently with the [PProf.jl](https://github.com/JuliaPerf/PProf.jl)
library, which can visualize the call stacks which are making the most
allocations.

The allocation profiler does have significant overhead, so a `sample_rate`
argument can be passed to speed it up by making it skip some allocations.
Passing `sample_rate=1.0` will make it record everything (which is slow);
`sample_rate=0.1` will record only 10% of the allocations (faster), etc.

!!! note

The current implementation of the Allocations Profiler _does not
capture all allocations._ You can read more about the missing allocations
and the plan to improve this, here: https://github.com/JuliaLang/julia/issues/43688.
Calling `Profile.Allocs.fetch()` will print a log line reporting the percentage
of missed allocations, so you can understand the accuracy of your profile.

## External Profiling

Currently Julia supports `Intel VTune`, `OProfile` and `perf` as external profiling tools.
Expand Down
4 changes: 2 additions & 2 deletions src/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ RUNTIME_SRCS := \
jltypes gf typemap smallintset ast builtins module interpreter symbol \
dlload sys init task array dump staticdata toplevel jl_uv datatype \
simplevector runtime_intrinsics precompile \
threading partr stackwalk gc gc-debug gc-pages gc-stacks method \
threading partr stackwalk gc gc-debug gc-pages gc-stacks gc-alloc-profiler method \
jlapi signal-handling safepoint timing subtype \
crc32c APInt-C processor ircode opaque_closure codegen-stubs coverage
SRCS := jloptions runtime_ccall rtutils
Expand Down Expand Up @@ -289,7 +289,7 @@ $(BUILDDIR)/disasm.o $(BUILDDIR)/disasm.dbg.obj: $(SRCDIR)/debuginfo.h $(SRCDIR)
$(BUILDDIR)/dump.o $(BUILDDIR)/dump.dbg.obj: $(addprefix $(SRCDIR)/,common_symbols1.inc common_symbols2.inc builtin_proto.h serialize.h)
$(BUILDDIR)/gc-debug.o $(BUILDDIR)/gc-debug.dbg.obj: $(SRCDIR)/gc.h
$(BUILDDIR)/gc-pages.o $(BUILDDIR)/gc-pages.dbg.obj: $(SRCDIR)/gc.h
$(BUILDDIR)/gc.o $(BUILDDIR)/gc.dbg.obj: $(SRCDIR)/gc.h
$(BUILDDIR)/gc.o $(BUILDDIR)/gc.dbg.obj: $(SRCDIR)/gc.h $(SRCDIR)/gc-alloc-profiler.h
$(BUILDDIR)/init.o $(BUILDDIR)/init.dbg.obj: $(SRCDIR)/builtin_proto.h
$(BUILDDIR)/interpreter.o $(BUILDDIR)/interpreter.dbg.obj: $(SRCDIR)/builtin_proto.h
$(BUILDDIR)/jitlayers.o $(BUILDDIR)/jitlayers.dbg.obj: $(SRCDIR)/jitlayers.h $(SRCDIR)/codegen_shared.h
Expand Down
1 change: 1 addition & 0 deletions src/array.c
Original file line number Diff line number Diff line change
Expand Up @@ -508,6 +508,7 @@ JL_DLLEXPORT jl_value_t *jl_alloc_string(size_t len)
s = jl_gc_big_alloc(ptls, allocsz);
}
jl_set_typeof(s, jl_string_type);
maybe_record_alloc_to_profile(s, len);
*(size_t*)s = len;
jl_string_data(s)[len] = 0;
return s;
Expand Down
139 changes: 139 additions & 0 deletions src/gc-alloc-profiler.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
// This file is a part of Julia. License is MIT: https://julialang.org/license

#include "gc-alloc-profiler.h"

#include "julia_internal.h"
#include "gc.h"

#include <string>
#include <vector>

using std::string;
using std::vector;

struct jl_raw_backtrace_t {
jl_bt_element_t *data;
size_t size;
};

struct jl_raw_alloc_t {
jl_datatype_t *type_address;
jl_raw_backtrace_t backtrace;
size_t size;
};

// == These structs define the global singleton profile buffer that will be used by
// callbacks to store profile results. ==
struct jl_per_thread_alloc_profile_t {
vector<jl_raw_alloc_t> allocs;
};

struct jl_alloc_profile_t {
double sample_rate;

vector<jl_per_thread_alloc_profile_t> per_thread_profiles;
};

struct jl_combined_results {
vector<jl_raw_alloc_t> combined_allocs;
};

// == Global variables manipulated by callbacks ==

jl_alloc_profile_t g_alloc_profile;
int g_alloc_profile_enabled = false;
jl_combined_results g_combined_results; // Will live forever.

// === stack stuff ===

jl_raw_backtrace_t get_raw_backtrace() {
// A single large buffer to record backtraces onto
static jl_bt_element_t static_bt_data[JL_MAX_BT_SIZE];

size_t bt_size = rec_backtrace(static_bt_data, JL_MAX_BT_SIZE, 2);

// Then we copy only the needed bytes out of the buffer into our profile.
size_t bt_bytes = bt_size * sizeof(jl_bt_element_t);
jl_bt_element_t *bt_data = (jl_bt_element_t*) malloc(bt_bytes);
memcpy(bt_data, static_bt_data, bt_bytes);

return jl_raw_backtrace_t{
bt_data,
bt_size
};
}

// == exported interface ==

extern "C" { // Needed since these functions doesn't take any arguments.

JL_DLLEXPORT void jl_start_alloc_profile(double sample_rate) {
// We only need to do this once, the first time this is called.
while (g_alloc_profile.per_thread_profiles.size() < jl_n_threads) {
g_alloc_profile.per_thread_profiles.push_back(jl_per_thread_alloc_profile_t{});
}

g_alloc_profile.sample_rate = sample_rate;
g_alloc_profile_enabled = true;
}

JL_DLLEXPORT jl_profile_allocs_raw_results_t jl_fetch_alloc_profile() {
// combine allocs
// TODO: interleave to preserve ordering
for (auto& profile : g_alloc_profile.per_thread_profiles) {
for (const auto& alloc : profile.allocs) {
g_combined_results.combined_allocs.push_back(alloc);
}

profile.allocs.clear();
}

return jl_profile_allocs_raw_results_t{
g_combined_results.combined_allocs.data(),
g_combined_results.combined_allocs.size(),
};
}

JL_DLLEXPORT void jl_stop_alloc_profile() {
g_alloc_profile_enabled = false;
}

JL_DLLEXPORT void jl_free_alloc_profile() {
// Free any allocs that remain in the per-thread profiles, that haven't
// been combined yet (which happens in fetch_alloc_profiles()).
for (auto& profile : g_alloc_profile.per_thread_profiles) {
for (auto alloc : profile.allocs) {
free(alloc.backtrace.data);
}
profile.allocs.clear();
}

// Free the allocs that have been already combined into the combined results object.
for (auto alloc : g_combined_results.combined_allocs) {
free(alloc.backtrace.data);
}

g_combined_results.combined_allocs.clear();
}

// == callback called into by the outside ==

void _maybe_record_alloc_to_profile(jl_value_t *val, size_t size) JL_NOTSAFEPOINT {
auto& global_profile = g_alloc_profile;
auto& profile = global_profile.per_thread_profiles[jl_threadid()];

auto sample_val = double(rand()) / double(RAND_MAX);
auto should_record = sample_val <= global_profile.sample_rate;
if (!should_record) {
return;
}

auto type = (jl_datatype_t*)jl_typeof(val);
profile.allocs.emplace_back(jl_raw_alloc_t{
type,
get_raw_backtrace(),
size
});
}

} // extern "C"
49 changes: 49 additions & 0 deletions src/gc-alloc-profiler.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
// This file is a part of Julia. License is MIT: https://julialang.org/license

#ifndef JL_GC_ALLOC_PROFILER_H
#define JL_GC_ALLOC_PROFILER_H

#include "julia.h"
#include "ios.h"

#ifdef __cplusplus
extern "C" {
#endif

// ---------------------------------------------------------------------
// The public interface to call from Julia for allocations profiling
// ---------------------------------------------------------------------

// Forward-declaration to avoid depenency in header file.
struct jl_raw_alloc_t; // Defined in gc-alloc-profiler.cpp

typedef struct {
struct jl_raw_alloc_t *allocs;
size_t num_allocs;
} jl_profile_allocs_raw_results_t;

JL_DLLEXPORT void jl_start_alloc_profile(double sample_rate);
JL_DLLEXPORT jl_profile_allocs_raw_results_t jl_fetch_alloc_profile(void);
JL_DLLEXPORT void jl_stop_alloc_profile(void);
JL_DLLEXPORT void jl_free_alloc_profile(void);

// ---------------------------------------------------------------------
// Functions to call from GC when alloc profiling is enabled
// ---------------------------------------------------------------------

void _maybe_record_alloc_to_profile(jl_value_t *val, size_t size) JL_NOTSAFEPOINT;

extern int g_alloc_profile_enabled;

static inline void maybe_record_alloc_to_profile(jl_value_t *val, size_t size) JL_NOTSAFEPOINT {
if (__unlikely(g_alloc_profile_enabled)) {
_maybe_record_alloc_to_profile(val, size);
}
}

#ifdef __cplusplus
}
#endif


#endif // JL_GC_ALLOC_PROFILER_H
1 change: 1 addition & 0 deletions src/gc.h
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
#endif
#endif
#include "julia_assert.h"
#include "gc-alloc-profiler.h"

#ifdef __cplusplus
extern "C" {
Expand Down
2 changes: 2 additions & 0 deletions src/julia_internal.h
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
#include "support/hashing.h"
#include "support/ptrhash.h"
#include "support/strtod.h"
#include "gc-alloc-profiler.h"
#include <uv.h>
#if !defined(_WIN32)
#include <unistd.h>
Expand Down Expand Up @@ -364,6 +365,7 @@ STATIC_INLINE jl_value_t *jl_gc_alloc_(jl_ptls_t ptls, size_t sz, void *ty)
v = jl_gc_big_alloc(ptls, allocsz);
}
jl_set_typeof(v, ty);
maybe_record_alloc_to_profile(v, sz);
return v;
}

Expand Down
17 changes: 17 additions & 0 deletions stdlib/Profile/docs/src/index.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# [Profiling](@id lib-profiling)

## CPU Profiling

```@docs
Profile.@profile
```
Expand All @@ -15,3 +17,18 @@ Profile.retrieve
Profile.callers
Profile.clear_malloc_data
```

## Memory profiling

```@docs
Profile.Allocs.@profile
```

The methods in `Profile.Allocs` are not exported and need to be called e.g. as `Profile.Allocs.fetch()`.

```@docs
Profile.Allocs.clear
Profile.Allocs.fetch
Profile.Allocs.start
Profile.Allocs.stop
```
Loading

8 comments on commit ef23d6d

@vtjnash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nanosoldier runbenchmarks(ALL, vs="25864e4341b0f11879a566bb6dbc571e230c26691")

@vtjnash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nanosoldier runbenchmarks(ALL, vs="@5864e4341b0f11879a566bb6dbc571e230c26691")

@nanosoldier
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something went wrong when running your job:

NanosoldierError: error when preparing/pushing to report repo: failed process: Process(setenv(`git push`; dir="/nanosoldier/workdir/NanosoldierReports"), ProcessExited(1)) [1]

Unfortunately, the logs could not be uploaded.

@vtjnash
Copy link
Member

@vtjnash vtjnash commented on ef23d6d Jan 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vtjnash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nanosoldier runbenchmarks("broadcast", vs="@5864e4341b0f11879a566bb6dbc571e230c26691")

@nanosoldier
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

@NHDaly
Copy link
Member

@NHDaly NHDaly commented on ef23d6d Jan 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for running Jameson - this appears to be perf neutral.

@NHDaly
Copy link
Member

@NHDaly NHDaly commented on ef23d6d Jan 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc: @vilterp

Please sign in to comment.