Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Memory Profiler #31534

Closed
wants to merge 5 commits into from
Closed

[WIP] Memory Profiler #31534

wants to merge 5 commits into from

Conversation

staticfloat
Copy link
Member

@staticfloat staticfloat commented Mar 29, 2019

yes-but-why-is-the-ram-gone

Overview

This branch adds a Memory Profiler to the Profile standard library. Accordingly, it re-arranges things such that the time profile code is stored within Profile.Time (but @profile and friends are exported, so it should be a mostly non-breaking change).

Memory Profiler internals

The memory profiler works much the same as the time profiler, it creates large buffers to store event backtraces as well as metadata about the allocation/deallocation. These large buffers are later examined from Julia code and timelines are reconstructed to pair allocations and deallocations appropriately to determine object lifetime, objects that are left "live" at the end of the instrumented region, etc...

Allocations are coarsely classified based upon their allocation domain (CPU, GPU, External, with room for this to grow) and their allocator (big, std or pool, relating to the three allocation code paths within Julia). Because a user may be primarily interested in certain allocations and not at all interested in others (and because Julia may generate many, many allocations over the course of a computation) a rudimentary filtering system exists, where the user may set up a bitmask for domain and allocator, allowing only certain events to be collected. Note that at the time of writing, only the CPU domain is tested.

The recorded allocation information (represented in Julia below):

struct allocation_info_t
    address::Ptr{Cvoid}
    T::Ptr{Cvoid}
    time::Float64
    allocsz::Csize_t
    tag::UInt16
end
  • address represents the pointer to the object that was allocated. If it remains live at the end of the computation, the user can actually inspect the object. If it is no longer live, accessing that memory is a fun way to randomly crash Julia.

  • T represents the type of the object that was allocated. This value (unless it is C_NULL) is always valid to dereference, as types are never destroyed. Type information is possibly unavailable, however this should only be the case when an external program is directly calling Julia's GC functions and is unaware of methods such as jl_memprofile_set_typeof(addr, T).

  • time is the wallclock time at which this event occurred.

  • allocsz is the number of bytes this allocation was made for.

  • tag is the or-wise combination of domain, allocator, and whether this event was an allocation or a deallocation.

Julia-facing API

The Julia side of things revolves around the macro @memprofile ex which runs gc(), turns on memory profiling, runs then the enclosed expression, gc()'s again, and finally turns off memory profiling. Once this has been done, the user can analyze the buffers by running Profile.Memory.read_and_coalesce_memprofile_data(), which returns three values: open_chunks, closed_chunks and ghost_chunks, where an "open" chunk is one that was allocated but never freed, a "closed" chunk is one that was allocated and freed, and a "ghost" chunk is one that was freed but never allocated.

These outputs are all "flat" in that there is no collation of results across backtrace locations, or anything like that. There are a host of interesting visualizations and data output formats that could be made upon this information, I just do not have time to do it right now.

TODO

  • Integrate flat -> tree tools for both Time and Memory profilers. The time profiler already contains some tools to convert the "flat" data arrays into a "tree" format that is substantially more readable and useful. We should be able to abstract that out a bit and create a useful "flat" -> "tree" collator that can deal with both of these kinds of data.

  • Add NEWS.md once the user-facing API is stabilized.

@ViralBShah
Copy link
Member

Must do a blog post to announce this as well!

@timholy
Copy link
Member

timholy commented Mar 29, 2019

Does this replace the --track-allocation=all functionality (and the corresponding tools in https://github.com/JuliaCI/Coverage.jl#memory-allocation), or are there reasons for both?

@StefanKarpinski StefanKarpinski requested a review from timholy March 29, 2019 13:17
@StefanKarpinski StefanKarpinski added the minor change Marginal behavior change acceptable for a minor release label Mar 29, 2019
@StefanKarpinski
Copy link
Member

Great writeup! Could just be the text of the blog post. Since this is a diagnostic tool, I think there's a little more leeway for API breakage for what it's worth.

stdlib/Profile/src/memory.jl Outdated Show resolved Hide resolved
of an allocated and then freed piece of memory. This function has also been known to
create highly competitive dota agents.
"""
function open_AI(a::allocation_info_t, alloc_stacktrace)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just extend Base.open/close for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I don't really like that kind of overloading. It feels like a type pun, there's nothing about this kind of "opening" and "closing" that is similar to Base.open or Base.close, and so there's not reason for things like docstrings, method tables, etc... to be shared between them.

base/error.jl Show resolved Hide resolved
@staticfloat
Copy link
Member Author

Does this replace the --track-allocation=all functionality (and the corresponding tools in JuliaCI/Coverage.jl#memory-allocation), or are there reasons for both?

There is definitely some overlap, but I don't fully grok what kinds of memory the --track-allocation keeps track of; for instance I don't think it will tracked pooled allocations (and it certainly won't tell you when the memory was freed).

I'm not sure what the use cases are for --track-allocation, so I think I'll need someone else to speak up and tell me what it's for to know whether this kind of system could "replace" it. Some of the drawbacks to this system are that you need to pre-allocate a chunk of memory, and if you run out you lose the tail end of your stats. (Luckily @time gives you an estimate on the number of allocations you need to hold, so this isn't such a difficult game to win at, although figuring out the proper mapping of num_allocations -> backtrace buffer length can be a bit of a guessing game) Another drawback is that when the allocator is tracking everything some code slows to a crawl; especially compilation (because compiling creates lots of little objects). I also don't have a separation between "user" code and "system" code; that kind of separation could be very useful, especially if we can integrate it into the C code so that system allocations don't even get recorded.

staticfloat and others added 5 commits April 1, 2019 18:32
Also adds `bt_overflow` flag instead of spitting out messages in the
middle of profiling, to be used by client profiling code.

This change allows for better checking of stack frames that could be
incomplete due to insufficient backtrace buffer space.  Realistically, a
single truncated stack trace in the case of a sampling time profiler is
unlikely to create large problems.  However when taking backtraces for
things such as a memory profiler, it is critical that all backtraces be
accurate, and so we allow client code to be somewhat stricter here.
This adds C support for a memory profiler within the GC, tracking
locations of allocations, deallocations, etc...  It operates in a
similar manner as the time profiler with single large buffers setup
beforehand through an initialization function, reducing the need for
expensive allocations while the program being measured is running.

The memory profiler instruments the GC in all locations that the GC
statistics themselves are being modified (e.g. `gc_num.allocd` and
`gc_num.freed`) by introducing new helper functions
`jl_gc_count_{allocd,freed,reallocd}()`.  Those utility functions call
the `jl_memprofile_track_{de,}alloc()` method to register an address,
a size and a tag with the memory profiler.  We also track type
information as this can be critically helpful when debugging, and to do
so without breaking API guarantees we insert methods to set the type of
a chunk of memory after allocating it where necessary.

The tagging system allows the memory profiler to disambiguate, at
profile time, between e.g. pooled allocations and the "big" allocator.
It also allows the memory allocator to support tracking multiple "memory
domains", e.g. a GPU support package could manually call
`jl_memprofile_track_alloc()` any time a chunk of memory is allocated on
the GPU so as to use the same system.  By default, all values are
tracked, however one can set a `memprof_tag_filter` value to track only
the values you are most interested in.  (E.g. only CPU domain big
allocations)

To disambiguate the memory and time profilers, we split them out into
separate modules.
Copy link
Member

@timholy timholy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the re-use of the time-profiling infrastructure here, very nice design.

It's been ages since I poked around the gc, and some time since I thought about profiling, so I am not perfectly suited to give feedback here. But overall I think the API needs to be a bit more layered and designed with "readout performance" in mind. I haven't played with this, but I wouldn't be surprised if just "collecting" memprofile data can take a long time; IIUC (and I may not) I think some relatively simple steps could drastically shorten that time.

EDIT: to expand on that, if you have m allocations that occur with the same backtrace of length n instruction pointers, this implementation is O(m*n) in both memory and time, and the construction of the stacktraces (with all the lookup) will be the slow step. If instead you aggregate into sets of unique backtraces (using just the instruction pointers, prior to the lookup) then you get something that's O(m+n) in the slow step. You can modify your AllocationInfo to allow list(s) of event times and allocation sizes so as to aggregate efficiently.

// Tags applied to memory allocations to specify which domain the memory is
// stored on, and also which "kind" of memory allocator was used.
// When filtering, a filter tag value of `0xffff` means "accept everything".
// We support the "CPU", "GPU" and "External" (e.g. "other") domains.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does mmap fall in this scheme?

}

// Next up; store allocation/type information
memprof_alloc_data[memprof_alloc_data_size].memory_location = v;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe hoist the indexing operation for slightly faster performance?

@@ -0,0 +1,9 @@
# Compatibility shims for old users who aren't used to the `Time` and `Memory` sub-modules
function init(args...; kwargs...)
@warn("Profile.init() is deprecated, use Profile.Time.init() or Profile.Memory.init() directly")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry this much scoping will be irritating to users. How about transitioning to an all-keyword interface?

There's also some concern about deprecations, I have the impression we're trying to avoid those. If so, it might be necessary to preserve the current interface as-is and just add to it.


Build a memory profile tag filter that will capture only events matching the given
filter parameters. This method is meant to mirror the `#define`'d `JL_MEMPROF_TAG_*`
values within `julia_internal,.h`. Valid memory domains are `:cpu`, `:gpu`, `:external`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better not to encourage users to have to open up julia_internal.h, better just to give the valid values here (and then you can add the more detailed info about correspondence as a comment somewhere).

Create an `AllocationInfo` that represents an allocated (and not yet deallocated) chunk
of memory. This `AllocationInfo` will later be "closed" to represent the full lifecycle
of an allocated and then freed piece of memory. This function has also been known to
create highly competitive dota agents.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, a chemistry reference 😉 https://en.wikipedia.org/wiki/DOTA_(chelator)

Less frivolously, I think this design encourages serious performance bottlenecks, see comments below in read_and_coalesce_memprofile_data.

"""
read_and_coalesce_memprofile_data()

Load in the backtrace and allocation data buffers from the C side of Julia-land, returns two
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I count 3. Spell out the meaning of ghost_chunks too.

@info("Opening $(chunk_id) $(a.T) $(a.allocsz)")

# Assert that we're not inserting an identical chunk
@assert !(chunk_id in keys(open_chunks_map)) "Doubly-opened memory chunk!"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I thought, "can't this happen just from re-use of pooled memory?" Needs a comment re the deletion below.


# Any leftover "open" chunks are just plain old leaked memory.
open_chunks = sort(collect(values(open_chunks_map)); by = x -> x.alloc_time)
closed_chunks = sort(closed_chunks, by = x -> x.alloc_time)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort!? (for both)


# If this `a` represents a free(), let's see if we're closing a previously opened chunk.
if !(chunk_id in keys(open_chunks_map))
push!(ghost_chunks, (a, bt))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These aren't ghosts, because ghosts were once alive 😄 More like "lost souls"

get_memprofile_tag_filter() = UInt16(ccall(:jl_memprofile_tag_filter, Cint, ()))

"""
read_and_coalesce_memprofile_data()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an API this feels awkward. The name suggests too much is being crammed into one function. If some users will only want detail on the "open" allocations, then the time spent constructing stacktraces for all the closed events will be a huge unnecessary load. Might be best to (1) parcel into types, (2) look for duplicates in the backtraces, and (3) look up only unique backtraces.

void jl_gc_count_allocd(void * addr, size_t sz, uint16_t tag) JL_NOTSAFEPOINT
{
#ifdef JULIA_ENABLE_THREADING
jl_atomic_fetch_add(&gc_num.allocd, sz);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't. This kills performance. The TODO in the code was wrong and this should be made thread local instead. The function that want to read these values is responsible for adding everything up. The only thread synchronization needed for the allocation thread should be to use relaxed atomic load and store and a normal increment.

@clarkevans
Copy link
Member

clarkevans commented Aug 12, 2020

For my application, I needed to know the resident memory high-water mark and processing time in order to do resource/workload planning. @btime got me timing and allocations, however, it didn't provide me the resident memory high-water mark. I think the hack below helped me get this memory usage detail. I'm sharing this because it might be useful to help understand a particular benchmarking need. Would the proposal above provide a similar metric?

sizings = tempname()
memfile = "/proc/$(Base.Libc.getpid())/statm"
bashcmd = `/bin/bash -c "while :; do cat $(memfile); sleep 1; done"`
parse_sizing(line::String)::Tuple{Float64, Float64} =
   tuple([round(parse(Int, size)*4096 /1000/1000, digits=2)
             for size in split(strip(line), " ")[1:2]]...)
gc()
(baseline_virtual, baseline_resident) = parse_sizing(read(memfile, String))
timer = run(pipeline(bashcmd, stdout=sizings), wait=false)
@btime my_long_running_process()  samples=5
kill(timer)
samples = [parse_sizing(string(line)) for line in
            split(strip(read(sizings, String)), "\n")]
rm(sizings)
max_virtual = max([a - baseline_virtual for (a,b) in samples]...)
max_resident = max([b - baseline_resident for (a,b) in samples]...)

What I discovered using the hack above was helpful to me. On my 32GB box, the performance is quick till I hit about 5GB, at which point the GC kicked in to keep my memory usage down (but increasing processing time, accounting for 50-80% of time). I could use this hack to see just how much cpu and resident memory I need per unit of work, so that I could do resource planning. Moreover, I was surprised to see just how much GC costs are for my workloads. This also let me speculate just how important vectored representations of fixed size data could be with regard to garbage collection costs; e.g. Vector{SubString} as pointers into a huge memory mapped JSON file rather than thousands of small strings. As a side note, I didn't see how I could open a process and have its output buffered; hence, I used a temporary file. Moreover, I didn't see how I could easily fork() the process, to have one procss monitor memory and the other run the computation; hence, I used bash.

@DilumAluthge DilumAluthge deleted the sf/memprofiler branch March 25, 2021 21:58
@vilterp vilterp mentioned this pull request Sep 22, 2021
29 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
minor change Marginal behavior change acceptable for a minor release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants