Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows profiler not profiling #3960

Closed
ncnc opened this issue Aug 6, 2013 · 17 comments
Closed

Windows profiler not profiling #3960

ncnc opened this issue Aug 6, 2013 · 17 comments
Labels
system:windows Affects only Windows

Comments

@ncnc
Copy link
Contributor

ncnc commented Aug 6, 2013

I'm using a 32-bit Windows build (MinGW, source pulled 6-Aug-13) on Windows 8.

I wanted to try out the new built-in profiler so, following the example in the manual, I defined myfunc.jl as:

function myfunc()
    A = rand(100, 100, 200)
    max(A)
end

and ran the following

julia> require("myfunc.jl")

julia> @profile (for i=1:100; myfunc(); end)
@profile (for i=1:100; myfunc(); end)

julia> Profile.print();
Profile.print();
         516 client.jl; _start; line: 372
          516 client.jl; run_repl; line: 166
             516 client.jl; eval_user_input; line: 91
                516 profile.jl; anonymous; line: 1
                   2  ...r\bin\myfunc.jl; myfunc; line: 2

julia>

However, this doesn't look quite right, in particular all traces except two seem to vanish into thin air after profile.jl.

Any ideas what's happening and/or how to debug it further?

@timholy
Copy link
Member

timholy commented Aug 6, 2013

516 backtraces is an awful lot compared to the example. That fact, coupled with the code snippet you showed, suggests that you didn't run it once to get it compiled first? (Since compiling takes time, that's consistent with an abundance of backtraces.) So probably most of those are in the compiler. I'd run it once and then collect the profiling data, or say Profile.clear() and try again.

As far as missing any other lines: you can try Profile.print(C=true) to see if it's disappearing down into the C. In theory you should be able to see the compiler operating (that works on Linux, anyway).

I don't have any direct experience with backtraces on Windows, so of course it's also possible that there are issues in terms of completeness.

@vtjnash
Copy link
Member

vtjnash commented Aug 7, 2013

Ah yes, I missed that. Given the quality of the windows platform (esp. the timers and backtrace infrastructure), you will probably need to run your code approximately 100x longer than on linux to get an equivalent sample.

The Windows version of the code for profiling C code is not yet written, so I'm not sure what it will tell you if you try (probably nothing).

@ncnc
Copy link
Contributor Author

ncnc commented Aug 7, 2013

[edit: the below reasoning is somewhat misguided, please refer to the next post]

It seems like jl_lookup_code_address() fails to link the location of (library) code to it's parent julia function.

Here's part of the lidict output from Profile.retreive():

[0x68e08fb6=>LineInfo("?","?",-1),0x68e0a9e5=>LineInfo("?","?",-1),0x68dd364e=>L
ineInfo("?","?",-1),0x68e090bf=>LineInfo("?","?",-1),0xc6546a26=>LineInfo("?","?
",-1),0x9b87c121=>LineInfo("?","?",-1),0xed32a3c7=>LineInfo("?","?",-1),0x6f020d
ad=>LineInfo("?","?",-1),0x002502a3=>LineInfo("?","?",-1),0x7798bf39=>LineInfo("
?","?",-1),0x64242639=>LineInfo("?","?",-1),0x64242633=>LineInfo("?","?",-1),0x6
42425c7=>LineInfo("?","?",-1), ...

At least some libraries seem to be at ~0x68000000.

I don't think it's directly related to the FIXME that you mention in frame_info_from_ip() in task.c, because doCframes is false by default so that code is never called in my case.

Could there be an issue with getFunctionInfo() in debuginfo.cpp? I don't think that function is used anywhere else?

@ncnc
Copy link
Contributor Author

ncnc commented Aug 7, 2013

Actually, looking close at the output, the raw data seems to be in reasonable shape after all. Might be some JIT:ed code that isn't recognized, but most of the information is there.

Here's an example dump of a stack trace together with the corresponding LineInfo data:

0x00000000      LineInfo("?","?",-1)
0x6874d7b7      LineInfo("?","?",-1)
0x02b80b11      LineInfo("myfunc","myfunc.jl",3)
0x02b80a7b      LineInfo("?","?",-1)
0x68dce32b      LineInfo("?","?",-1)
0x02b809e7      LineInfo("anonymous","profile.jl",1)
0x68e040fc      LineInfo("?","?",-1)
0x68dd364e      LineInfo("?","?",-1)
0x02b7b4d6      LineInfo("eval_user_input","client.jl",91)
0x02b7b1be      LineInfo("?","?",-1)
0x68dce32b      LineInfo("?","?",-1)
0x02b40a59      LineInfo("run_repl","client.jl",166)
0x02b40859      LineInfo("_start","client.jl",372)
0x02b401cb      LineInfo("?","?",-1)
0x68dce32b      LineInfo("?","?",-1)
0x004018c8      LineInfo("?","?",-1)
0x68dfc6ca      LineInfo("?","?",-1)
0x00435745      LineInfo("?","?",-1)
0x004013fa      LineInfo("?","?",-1)
0x764b850d      LineInfo("?","?",-1)
0x7798bf39      LineInfo("?","?",-1)
0x7798bf0c      LineInfo("?","?",-1)
0x00000000      LineInfo("?","?",-1)

In this case, all relevant information seems to be present, but still Profile.print() doesn't show myfunc() at all:

julia> Profile.print()
Profile.print()
         47 client.jl; _start; line: 372
          47 client.jl; run_repl; line: 166
             47 client.jl; eval_user_input; line: 91
                47 profile.jl; anonymous; line: 1

julia>

I'll have to do some more digging, but I'm guessing that the unidentified lines mess up the profiling logic.

@timholy
Copy link
Member

timholy commented Aug 7, 2013

I'll have to do some more digging, but I'm guessing that the unidentified lines mess up the profiling logic.

They shouldn't; I get lots of unidentified lines on Linux, and they don't cause a problem.

In the abstract it's hard for me to figure out the problem. If you prefer to debug it yourself (clearly you've already learned a lot about how Profile works), first try Profile.print(format=:flat) and see what happens. Then, you can print just a single backtrace:

data, lidict = Profile.retrieve();
bt, counts = Profile.tree_aggregate(data);
# Now find a backtrace that seems reasonable, e.g., bt[1]
Profile.tree(STDOUT, bt[1:1], [1], lidict, 0, true, 80)

The only thing I've noticed is that there's at least one strange thing about your copy/paste example: do you always get 2 blank lines after Profile.print()? I only get one on Linux. Makes me wonder if it's a printing problem? (That seems unlikely, I know.)

Alternatively, if you want me to help debug, can you send me your backtrace data? I've already gotten this set up so that it should be pretty trivial (requires installing HDF5, see installation instructions at https://github.com/timholy/HDF5.jl)

using HDF5, JLD
data, lidict = Profile.retrieve()
f = jldopen("profile.jld","w")
@write f data
@write f lidict
close(f)

@ncnc
Copy link
Contributor Author

ncnc commented Aug 7, 2013

Thanks, I'll take a closer look when I have a moment. From a brief look, it seems like information disappears at the tree_aggregate() stage.

@ncnc
Copy link
Contributor Author

ncnc commented Aug 7, 2013

I did some more digging and it seems like the cause is btskip in profile.jl. The current value of two cuts off relevant information for me, as can be seen from the stack trace dump I posted earlier today (myfunc is the second lowest entry, the lowest entry is likely library code, both are relevant). A btskip of zero would make things work.

OTOH, maybe it would be cleaner to use the same conventions in the backtrace code between platforms, if possible? @vtjnash?

@timholy timholy closed this as completed in bfb20f8 Aug 7, 2013
@timholy
Copy link
Member

timholy commented Aug 7, 2013

Thanks for figuring that out! Can you rebuild julia and test now?

@ncnc
Copy link
Contributor Author

ncnc commented Aug 7, 2013

Seems to work, AFAICS:

julia> @profile myfunc()
@profile myfunc()
0.9999994007985493

julia> Profile.print()
Profile.print()
         36 client.jl; _start; line: 372
          1  client.jl; run_repl; line: 158
          35 client.jl; run_repl; line: 166
             35 client.jl; eval_user_input; line: 91
                35 profile.jl; anonymous; line: 14
                   27 ...r\bin\myfunc.jl; myfunc; line: 3
                    3  reduce.jl; max; line: 35

julia> @profile (for i=1:100; myfunc(); end)
@profile (for i=1:100; myfunc(); end)

julia> Profile.print()
Profile.print()
         2777 client.jl; _start; line: 372
          1    client.jl; run_repl; line: 158
          2776 client.jl; run_repl; line: 166
             2776 client.jl; eval_user_input; line: 91
                2741 profile.jl; anonymous; line: 1
                   1    ...r\bin\myfunc.jl; myfunc; line: 2
                   1885 ...r\bin\myfunc.jl; myfunc; line: 3
                    104  reduce.jl; max; line: 35
                    4    reduce.jl; max; line: 36
                    108  reduce.jl; max; line: 37
                35   profile.jl; anonymous; line: 14
                   27 ...r\bin\myfunc.jl; myfunc; line: 3
                    3  reduce.jl; max; line: 35

Hopefully it doesn't break 64-bit Windows.

@timholy
Copy link
Member

timholy commented Aug 7, 2013

At most, having btskip > 0 is merely a convenience (on linux you'd otherwise see steps that occur inside the signal-handler, which is not wrong, it's just not useful). I'd be very surprised if setting it to 0 broke anything.

Thanks for working through this with me!

@vtjnash
Copy link
Member

vtjnash commented Aug 20, 2013

LGTM.

There isn't two extra frames on windows to insert, so it's probably OK to have a different value for btskip. Alternatively, we could avoid including those extraneous two, and probably fit a few more backtraces into memory.

Note, I added a new macro recently so you could write this more concisely as: const btskip = @windows? 2 : 0

@timholy
Copy link
Member

timholy commented Aug 20, 2013

Alternatively, we could avoid including those extraneous two, and probably fit a few more backtraces into memory.

I thought about that, but I was concerned doing that part in the C code might make it more confusing to debug if platforms differed (which indeed turned out to be the case). However, I don't think it's a major point. Perhaps more relevantly, in practice I think many backtraces have length ~20 or so, so the gain would be modest.

@StefanKarpinski
Copy link
Member

Would run-length-encoding backtraces like we do when showing errors help?

@timholy
Copy link
Member

timholy commented Aug 20, 2013

It would help a lot, and I considered it from the outset, but I have the impression that you don't want to allocate memory in a signal handler. If I'm wrong about that, then this might be viable. But note we'd have to write a "dict" type in pure C.

@timholy
Copy link
Member

timholy commented Aug 20, 2013

...or use STL's map, if memory allocation in a signal handler is OK.

@vtjnash
Copy link
Member

vtjnash commented Aug 21, 2013

It is not possible to allocate more memory in a signal handler. It will (eventually, randomly) lock up the program. But you can pre-allocate as much space as you want, e.g. for a fixed size dictionary.

2/20 = 10% savings isn't too bad.

Alternatively, maybe it would be possible to do some really minimal but fast compression by recording how much of the prefix from the previous backtrace is duplicated in the current frame. This could also potentially help speed printing of the frames afterwards.

@timholy
Copy link
Member

timholy commented Aug 21, 2013

It is not possible to allocate more memory in a signal handler.

I suspected as much, thanks for clarifying.

A third possibility would be to use ProfileView's more sophisticated tree structure and add new nodes on-the-fly, while also updating counts for the final (triggering) line in each backtrace. The cool thing about this idea is that it automatically incorporates your "prefix" idea without being restricted to the previous backtrace. (I don't think we care about the order of the backtraces, which would be discarded with this approach). This is also an attractive solution from the standpoint of the constraint on memory allocation; we could allocate a big initial pool of nodes, but have them all be "disconnected." All it would take would be to change a couple of links and suddenly they're part of the tree.

There's a (rather remote) chance that all that code is allocation-free (i.e., doesn't touch the gc()) and might therefore be callable from the signal handler. That's pretty scary, though, it might be safer to write its analog in C. Also, I think that code uses lookups, rather than instruction pointers, and from a performance perspective I don't think we want the signal handler doing lookups. So we'd probably need a rewrite anyway.

Assuming there isn't some technical hurdle, I suspect this idea would lead to really massive compression. It would also be a significant influx of new code into base and add some degree of complexity. I've been enjoying having the simple Profile.print() as a safe fallback to ProfileView's more sophisticated approach. That said, it has been a while since I discovered any problems with ProfileView's parsing algorithm, so maybe it's getting to the point where it would be reasonably safe to rely on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
system:windows Affects only Windows
Projects
None yet
Development

No branches or pull requests

4 participants