Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: FuncForPC/FileLine/Caller/Callers api interferes with implementation of better inlining #11432

Closed
dr2chase opened this issue Jun 26, 2015 · 21 comments
Milestone

Comments

@dr2chase
Copy link
Contributor

I'm fishing for comments. The current API gives away too much from the POV of a compiler-writer.

And I need to be clear on one point -- if we don't resolve this, there's well-understood optimizations that Go compilers won't do, and these are sufficiently important optimizations that they affect the way people write programs.


Current (1.<=5) Go only inlines when it can inline entire call trees all the way to the leaves; it cannot, for example, inline a simple function Small that wraps a much larger function Huge, because Huge is not eligible for inlining and thus all callers of Huge are not eligible for inlining. The current total-inlining policies mean that Go programs never observe the call sites at which the inlining occurs, and thus are never presented with the dilemma that there can be a stack of "callers" corresponding to a single return PC.

If we allow more flexible inlining, and if A calls B, B calls runtime.Callers(1, pcslice) and B is inlined into A, the "return pc" for Callers (pcslice[0]) will be an address within A (according to nm) and there will be no return PC identifying B or the file of B or the line number within B.

And unfortunately, this return PC = single caller identity is exposed in an interface -- a "return pc" is a uintptr. We could in theory hack on that integer to embed inline depth information in some unused bits, but in practice that is likely to break some code.

So, if we intend to improve inlining in the future, we need to do something about this. The options I see (and I may be myopic, other options are welcome) include:

(a) simply ignore intermediate callers. In the example, the caller of Callers will be A, and any mention of B is lost. It's a systems programming language, optimizers do this stuff all the time in other languages and you love it when they do, put on your grown-up pants and deal with it.

pros: easy.
And it will prepare people for tail-call elimination.
And it will discourage them from thinking that they can use this for implementing Java-style caller-sensitive security, which is slow, yet difficult to reason about and hard to maintain. Seriously, this has been a source of numerous security holes in Java, in JDK 8 they simplified it as much as possible to reduce their risk (see e.g. http://openjdk.java.net/jeps/176 and https://bugs.openjdk.java.net/browse/JDK-8046166 ).

cons: people might already critically rely on exact stack traces.

(b) hack on the encoding of the return PC to embed an index. I'd propose to grab the upper 2-3 bits on an Intel box or the lower 2 bits on a RISC architecture (this imposes an obvious restriction on inline depth, and I assume that I can grab these bits even on a 32-bit machine), and have the counter indicate the inline depth, thus the 00 case will match what is seen in the nm output (in the above example, "A") and the 01 case corresponds to B. The symbol table information for "A" will need to provide the inlining depth at this site. In the example above, if it is IA32 and the raw RPC is 0x01234567, then the caller of Callers (B) is return PC 0x41234567 and B's caller ("A") is return PC 0x01234567. On a RISC, raw PC 0x03217654 is the same as the RPC for "A" and 0x03217655 is the RPC for "B" (called by A).

pros: no change to interface, we never promised that "returnpc" was good for anything, did we?

cons: people thought we promised "returnpc" was a real live returnpc, and acted accordingly.
Fails to adequately discourage Java-style caller-sensitive security.
Also fails to prepare people for tail-call elimination.

(c) = (a) + extend the interface with a more generalized notion of program counter. Most old code will continue to sort-of work without change and no risk of gigantic surprise (e.g., illegal address dereference from believing that an option-b PC was a real address), new code will use the improved interface.

Note that there's nothing novel about abstracting on the "return PC" to get something that allows you to talk precisely about backtraces in the presence of inlining -- Java implementations have been doing this for more than 15 years.

@dr2chase dr2chase added this to the Go1.6 milestone Jun 26, 2015
@randall77
Copy link
Contributor

On Fri, Jun 26, 2015 at 12:55 PM, dr2chase [email protected] wrote:

I'm fishing for comments. The current API gives away too much from the POV
of a compiler-writer.
And I need to be clear on one point -- if we don't resolve this, there's
well-understood optimizations that Go compilers won't do, and these are
sufficiently important optimizations that they affect the way people write
programs.

Current (1.<=5) Go only inlines when it can inline entire call trees all
the way to the leaves; it cannot, for example, inline a simple function
Small that wraps a much larger function Huge, because Huge is not eligible
for inlining and thus all callers of Huge are not eligible for inlining.
The current total-inlining policies means that go programs never observe
the call sites at which the inlining occurs, and thus are never presented
with the dilemma that there can be a stack of "callers" corresponding to a
single return PC.

If we allow more flexible inlining, and if A calls B, B calls
runtime.Callers(1, pcslice) and B is inlined into A, the "return pc" for
Callers (pcslice[0]) will be an address within A (according to nm) and
there will be no return PC identifying B or the file of B or the line
number within B.

And unfortunately, this return PC = single caller identity is exposed in
an interface -- a "return pc" is a uintptr. We could in theory hack on that
integer to embed inline depth information in some unused bits, but in
practice that is likely to break some code.

So, if we intend to improve inlining in the future, we need to do
something about this. The options I see (and I may be myopic, other options
are welcome) include:

(a) simply ignore intermediate callers. In the example, the caller of
Callers will be A, and any mention of B is lost. It's a systems programming
language, optimizers do this stuff all the time in other languages and you
love it when they do, put on your grown-up pants and deal with it.

pros: easy.
And it will prepare people for tail-call elimination.
And it will discourage them from thinking that they can use this for
implementing Java-style caller-sensitive security, which is slow, yet
difficult to reason about and hard to maintain. Seriously, this has been a
source of numerous security holes in Java, in JDK 8 they simplified it as
much as possible to reduce their risk (see e.g.
http://openjdk.java.net/jeps/176 and
https://bugs.openjdk.java.net/browse/JDK-8046166 ).

cons: people might already critically rely on exact stack traces.

I'd rather not do this. Not because people might rely programmatically on
exact stack traces, but because stack traces with holes in them are hard to
understand. I care less about the actual PCs that show up, but somehow
text backtraces should make sense in the original (uninlined) program if at
all possible.

(b) hack on the encoding of the return PC to embed an index. I'd propose
to grab the upper 2-3 bits on an Intel box or the lower 2 bits on a RISC
architecture (this imposes an obvious restriction on inline depth, and I
assume that I can grab these bits even on a 32-bit machine), and have the
counter indicate the inline depth, thus the 00 case will match what is seen
in the nm output (in the above example, "A") and the 01 case corresponds to
B. The symbol table information for "A" will need to provide the inlining
depth at this site. In the example above, if it is IA32 and the raw RPC is
0x01234567, then the caller of Callers (B) is return PC 0x41234567 and B's
caller ("A") is return PC 0x01234567. On a RISC, raw PC 0x03217654 is the
same as the RPC for "A" and 0x03217655 is the RPC for "B" (called by A).

pros: no change to interface, we never promised that "returnpc" was good
for anything, did we?

We did promise it would be a valid argument to FuncForPC/FileLine.

There's no need to steal bits - we can assign arbitrary PCs to encode
inlined bodies. At one (virtual address) byte per PC, it shouldn't be too
expensive.

cons: people thought we promised "returnpc" was a real live returnpc, and
acted accordingly.
Fails to adequately discourage Java-style caller-sensitive security.
Also fails to prepare people for tail-call elimination.

(c) = (a) + extend the interface with a more generalized notion of program
counter. Most old code will continue to sort-of work without change and no
risk of gigantic surprise (e.g., illegal address dereference from believing
that an option-b PC was a real address), new code will use the improved
interface.

Note that there's nothing novel about abstracting on the "return PC" to
get something that allows you to talk precisely about backtraces in the
presence of inlining -- Java implementations have been doing this for more
than 15 years.

I think this is what I was describing above. We should do this one.


Reply to this email directly or view it on GitHub
#11432.

@ianlancetaylor
Copy link
Member

I struggle with this as well in gccgo. gccgo does of course implement arbitrary inlining (but currently does not do tail calls). Gccgo does full correct unwinding of inlined functions, using the debug info to track inline sites. However, this only works with runtime.Caller. If you use runtime.Callers, and then pass the resulting PC values to FuncForPC, you don't get a proper stack. At an inlining point, you get the same file/line multiple times.

I agree with Keith that the best approach is to encode inline positions in the PC value, somehow.

@dr2chase
Copy link
Contributor Author

I'm happy with encoding into "PC" values. The 3 options that quickly come to mind are:

  1. take existing "return PC" and hack into known-unused bits (high order on x86, low order on PPC/MIPS/ARM/SPARC). The advantage of this is that if the "bogus" PCs are filtered out of the stack trace, the ones that remain are the ones that you would see in a machine-level debugger, and also correspond to what you'd see in nm (e.g., A inlines B inlines C, the "good" PC belongs to A, and that's what nm shows).
  2. repurpose other PCs within the image, perhaps correspond to rough inline boundaries (never mind what optimization might do). All PCs are "real", only the innermost one corresponds to a real call site.
    We might even leave marker instructions (nops) in the stream. The advantage is all PCs are real, the disadvantage is that the name of the one that looks like a call site won't match the addresses you see in nm.
  3. make up something completely different that is just designed to get you the information that we promised, that does not correspond to nm at all. Advantage here is that there's no risk of subtle confusion -- these things won't look like PCs at all, and we won't run into any weird corner cases based on attempts to fake it. (This is what we did years ago in a static Java-bytecodes compiler, but we had no installed base to confuse, and sadly, never did).

I think I have an anti-preference against 2. That seems most-confusing and also most likely to lead to non-intuitive hair in things like dead code elimination and code motion.

Where do we stand on tail-call elimination? Are we more nearly "this is an important optimization, users must deal with it" or "stack traces are really important, think about the time added to debugging/diagnosis when information is lost"? There have been proposals/hacks over the years for figuring out ways to deal with TCE, but none of them come close to the (cache, stack size, simplicity) efficiency of the real thing. Maybe we can think of new hacks???

@ianlancetaylor
Copy link
Member

With regard to tail call optimization, one idea that has been discussed in the past (e.g., https://groups.google.com/d/msg/golang-nuts/JZZrFjFKkuM/MV9sSJsFlFEJ) is to add an explicit statement to the language to implement tail calls ("become f()"). Only for Go 2, though.

I think that in the absence of some sort of explicit indication that a tail call is acceptable, the compiler should not implement them. That is, I'm on the side of "stack traces are really important."

That said, let me also mention http://www.dwarfstd.org/ShowIssue.php?issue=100909.2 which is an approach used by GCC to permit tail calls to be reconstructed from the debug info. So one option is to only make a tail call when the call sequence can be reconstructed.

(As far as fake PC values goes I think I would tentatively vote for your option 3, in the sense of use real PC values for simple addresses and completely different values for inlined addresses, but I have no strong feelings about it.)

@josharian
Copy link
Contributor

A few thoughts:

  • Having real PC values as often as possible is valuable. Within the last fortnight, Austin asked for nm output to correlate to a stack trace to help debug a runtime bug.
  • Relatedly, if we use invented PCs, it would be nice if they were roughly comprehensible given only the binary and appropriate tooling.
  • Unsurprisingly given the above, I'm also on the side of "stack traces are really important".
  • testing: add t.Helper to make file:line results more useful #4899 might be an interesting use case to look at in this connection. It is entangled with testing API design, but entangled in relevant way, I think.
  • I'm disinclined to shove stuff into unused bits, as it doesn't fail as reliably as you would want when you miss a spot--and you always miss a spot.

@dr2chase
Copy link
Contributor Author

dr2chase commented Oct 3, 2015

Suppose for the return PCs we try the following: if A inlines B inlines C, the single PC in the "real" stack trace will appear within A. Annotate that PC appropriately to indicate the inlining patterns, and for the two synthesized PCs, choose the appropriate in site in non-inlined B where C is called, and the appropriate site in non-inlined C where the backtrace event occurred.

Also write the api and implementation that we wish we had started with, and in the old code recommend the new api and note that sometimes the stack of "PCs" contains white lies.

An improvement?

@ianlancetaylor
Copy link
Member

Your PC choices sound good but they sound hard to implement. If A, B, and C are all in different packages, it's not going to be easy for the compilation of A to know the PC values to use to represent the inlining of B and C. If we can implement them with no extra cost, then, yes, sounds great.

@dr2chase
Copy link
Contributor Author

dr2chase commented Oct 7, 2015

The difficulty in my example is "B" -- if A inlines B and C, then it's likely that B inlines C, and thus there is not necessarily (unless we extend the fake) any good place to say that "B calls C". It would also lead to a proliferation of symbols (most likely). Russ seemed to think (I hope I am not putting words in his mouth) that it would be perhaps okay to just drop PCs from the old interface, as long as we provide a new way to obtain all the funcs/files/lines that apply at a given call site and use that for usual-case user-visible stack traces. I.e., it's not entirely clear that the customers for slices of caller-PC have much use for fake PCs anyhow.

@ChrisHines
Copy link
Contributor

As the author/maintainer of logging packages (github.com/go-kit/kit/log and gopkg.in/inconshreveable/log15.v2) and a standalone package that consumes runtime.Callers (gopkg.in/stack.v1) I would like to provide my perspective on this and related issues.

For my use cases I don't care about the actual PC values. I care that I can use them with runtime.FuncForPC and get accurate file/line-number/function-name information. I also want my code to report on call-sites rather than return-sites. It is rather cumbersome to deal with the special cases documented by runtime.Callers to convert the return-site PCs into call-site PCs. As I said in the mailing list recently, if a new API comes out of this discussion it would be nice if it could simplify the logging/debugging use case.

Dropping inlined call-sites would pose problems to the libraries I build, putting me on the side of "stack traces are really important". Furthermore, the opposite problem—extra entries for <generated> functions that appear in stack traces, but not in the code—are also a source of confusion that add little or no value to my use cases. For an example of code that was changed specifically to deal with that issue, see: go-kit/kit#106, and in particular this comment describing the nuances of what's going on. I wonder if a mechanism to provide stack trace data in the face of inlined functions could also hide the <generated> functions from stack traces as well?

@dr2chase
Copy link
Contributor Author

dr2chase commented Oct 7, 2015

Hiding generated functions from stack traces is not too far off of #4899 .
I think stack traces are really important, but it's not clear to me if we're better off inserting fake PCs into the existing interface, or making a new interface that allows us to preserve the illusion of no-inlining. I don't have enough experience with Go or the users of this API to say what's best.

@ianlancetaylor
Copy link
Member

As far as runtime.Callers goes, we should probably think in terms of a new interface. Though the transition will be slow. Ideally runtime.Caller will consider inlining.

Of course the most important thing is that stack dumps consider inlining.

@rsc
Copy link
Contributor

rsc commented Nov 5, 2015

We're not going to eliminate tail calls. Let's just get that out of the way.

@rsc rsc changed the title runtime: FuncForPC/FileLine/Caller/Callers api interferes with implementation of better inlining (and tail-call elimination) runtime: FuncForPC/FileLine/Caller/Callers api interferes with implementation of better inlining Nov 5, 2015
@rsc rsc modified the milestones: Unplanned, Go1.6 Nov 5, 2015
@rsc
Copy link
Contributor

rsc commented Nov 5, 2015

Let's also take "fake PCs" off the table. That will introduce significant complexity and confusion.

New API is needed, but probably not much.

It's true that a single PC may in general correspond to a tiny call stack of func/file/line entries. Assume this:

type Frame struct {
    Func string
    File string
    Line int
}

type Frames struct { ... opaque ... }

func (*Frames) Len() int
func (*Frames) Frame(int) Frame // 0 is inner-most frame

Then:

  1. FuncForPC can be left alone: each byte of code does belong to just one function.
  2. (*Func).FileLine(pc) can continue to return the top-most (outer-most) file/line in the call stack for pc.
  3. (*Func).Frames(pc) can return the more complete Frames information.
  4. Callers can be left alone: there really is a stack of program counters at run time, and Callers can continue to return them. The skip parameter counts frames, not PCs (today those are the same).
  5. Caller can count frames instead of PCs but otherwise is fine. This means that if you have a PC corresponding to two frames, Callers(i) and Callers(i+1) may both return that PC, just with different file, line information. It would be nice if it returned a function name too, but oh well.

People who report full stacks using repeated calls to Callers are already writing quadratic code, so it seems okay that it's imperfect in a different way now too. People who want to report the full call stack more carefully will use Callers to get some PCs and then call FuncForPC+Frames for each of those.

@ChrisHines
Copy link
Contributor

@rsc Have you perhaps written "Callers" in a few places where you meant "Caller"? Specifically in (5) and the last paragraph?

@ianlancetaylor
Copy link
Member

I'm looking at this in the context of also supporting stack tracebacks for C code called by cgo. I would like to support those tracebacks in Callers and get good information for them in runtime/pprof and friends. Currently it's hard to correctly turn the Callers information into a good stack traceback, as can be seen by the fact that the implementation in runtime/pprof is subtly incorrect (it checks for runtime.gopanic where it should check for runtime.sigpanic).

Russ's Frames suggestion makes sense for Go, but it's hard to implement for C code, and it doesn't address the complexities of when and how much to adjust the PC value.

I propose that rather than referring to sigpanic in Callers, we instead say

// To easily look up file/line information for the call sequence, use
// CallersIterate.

and then we implement

// CallersIterator may be used to get function/file/line information
// for a slice of PC values returned by Callers.
type CallersIterator struct ...

// CallersIterate takes a slice of PC returned by Callers and prepares
// to return function/file/line information.
// Do not change the slice until you are done with the CallersIterator.
func CallersIterate(callers []uintptr) *CallersIterator {}

// Next returns function/function entry point/file/line information
// for the next caller.
// If more is false, there are no more callers (the other results are valid).
// This method may return multiple times for a single PC value in the
// original slice, to represent inlined functions.
// If no information is available, function and/or file may be empty,
// and line may be zero.
func (ci* CallersIterator) Next() (pc uintptr, function string, entry uintptr, file string, line int, more bool) {}

@ChrisHines
Copy link
Contributor

@ianlancetaylor I like your proposed new API for CallersIterate. I have two questions:

  1. Would the new API also skip <generated> functions as I described in my comment above?
  2. Would CallersIterator.Next return line numbers for call-sites or return-sites?

@ianlancetaylor
Copy link
Member

  1. I see the handling of generated functions as an orthogonal issue. Personally I think it's best to include them at this level and filter them out at a higher level. But we can decide that separately.
  2. Call-sites.

@dr2chase
Copy link
Contributor Author

Should the per-"call" information include some information about the function, such as "was generated", "was inlined"? There is certainly a point in the compilation where we know about that, so we could just choose not to forget it.

@ChrisHines
Copy link
Contributor

Combining these ideas, the signature of CallersIterator.Next could look like:

func (ci* CallersIterator) Next() (f Frame, more bool)

type Frame struct {
    Func      string
    File      string
    PC        uintptr
    Entry     uintptr
    Line      int
    Generated bool // maybe
    Inlined   bool // maybe
}

Returning a struct type allows future additions and seems less cumbersome to use for API clients.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/19869 mentions this issue.

gopherbot pushed a commit that referenced this issue Feb 25, 2016
This indirectly implements a small fix for runtime/pprof: it used to
look for runtime.gopanic when it should have been looking for
runtime.sigpanic.

Update #11432.

Change-Id: I5e3f5203b2ac5463efd85adf6636e64174aacb1d
Reviewed-on: https://go-review.googlesource.com/19869
Run-TryBot: Ian Lance Taylor <[email protected]>
TryBot-Result: Gobot Gobot <[email protected]>
Reviewed-by: David Chase <[email protected]>
@rsc
Copy link
Contributor

rsc commented May 18, 2016

I think this is fixed. Better inlining needs to keep runtime.Frames working, and code needs to use it, but it's all there.

@rsc rsc closed this as completed May 18, 2016
@golang golang locked and limited conversation to collaborators May 18, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants