proposal: JS tracing APIs #48

jasongin · 2017-01-17T19:08:23Z

@nodejs/diagnostics and anyone else interested, please review this proposal draft.

AndreasMadsen · 2017-01-17T20:04:38Z

Is this a follow up on "trace events" as introduced in nodejs/node#9304?

I'm generally against using the term tracing formally, it can mean so many things. For example we already have:

LTTng (Linux), SystemTap (Linux), DTrace (OSX), ETW (Windows)
async_hooks
trace_events

all of which does some kind of tracing. If this API integrates with trace_events I think we should just call the module trace_events.

jasongin · 2017-01-17T20:09:28Z

Yes, the Background section of the proposal doc links to that "trace events" PR.

As also noted in the doc, the module name "tracing" was already reserved for this purpose long ago. I don't feel strongly about it personally, certainly we can go with whatever name is agreed by consensus.

Qard · 2017-01-17T19:50:18Z

00X-tracing.md

+const scopeId = tracing.emit(category, { eventType: 'begin', name: 'myTask' });
+tracing.emit(category, { eventType: 'end', name: 'myTask', id: scopeId });
+
+tracing.emit(category, { eventType: 'count', name: 'myCounter', value: 3 });


This emit function is different enough from EventEmitter.emit that I'd prefer it was named differently so as to not confuse new users.

Perhaps we should support only the alternate API? All the conditional validation needed for this unified function is a bit worrying.

I agree, this should not be called emit at all. Maybe record?

Qard · 2017-01-17T19:53:03Z

00X-tracing.md

+  emit(
+    category: string | string[],
+    name: string,
+    args?: { [name: string]: any } | () => { [name: string]: any }): boolean;


Maybe we should call this instant rather than emit. Multi-signature methods risk unforeseen problems, as the buffer issue has proven.

In addition, the polymorphism of multi-signature methods has implications for optimization, which raises performance concerns. Historically, this has led to problems for APM and other diagnostic solutions that are intended to be embedded in production applications. In general, the simpler / closer to monomorphic the APIs can be made, the better.

Agreed, this method would be better named instant. (Or perhaps timeStamp?)

+1 for instant because it matches the terms used in the trace format descriptions. It makes it easier to find it elsewhere.

Renamed to instant in a new commit.

Qard · 2017-01-17T20:03:09Z

00X-tracing.md

+
+To support listeners, the `Tracing` class (from the previous section) extends a new
+`CategoryEventEmitter` base class. That class is similar to `EventEmitter`, but does not extend
+`EventEmitter` because it has different behavior related to the multi-category support.


Again, similar yet different enough from a typical event emitter that I worry it'd be confusing to new users.

I considered trying to make it more different from EventEmitter, but it felt silly to use different names for operations that are really the same as the normal EventEmitter methods.

I think the differences probably should not really be a problem. You can in fact use a CategoryEventEmitter (which the Tracing module extends) just like a normal EventEmitter if you don't care about the multi-category support.

Would it help to think of CategoryEventEmitter as a subclass of EventEmitter? (I considered literally extending the class, but it didn't provide any benefit since all the methods needed to be overridden to use different data structures from the base class.)

Qard · 2017-01-17T20:03:35Z

00X-tracing.md

+tracing.emit(['example2', 'example3'], 'info', 'Hello tracing!');
+```
+
+### 3.2 Listern API specification


Fixed in a new commit.

Qard · 2017-01-17T20:07:20Z

00X-tracing.md

+
+The existing Node.js implementation of those Console tracing methods simply logs tracing events to
+the console. This proposal extends those methods with a new optional category parameter to enable
+emitting categorized tracing events in a way that is backward-compatible with existing console APIs.


I don't like the idea of deviating from the behaviour of the console API spec. If they decide to add their own extra argument in the future, suddenly we have problems.

Understood; I haven't entirely convinced myself that it is a good idea either. That's why I made it an optional part of the proposal. It might be nice for some scenarios, so I'd like to at least get some feedback about the idea of integrating with the console tracing APIs.

I'm 👎 in adding any complexity to console.

othiym23 · 2017-01-18T03:29:27Z

While I agree that treating those three APIs as separate concerns that can be implemented separately, I think it's helpful to treat emitting and listening of trace data as duals for a single abstraction.

As someone who, like @Qard, has worked on APM products in the past, the challenge for me was that there have been a variety of strategies for emitting trace events, but no reliable way to observe that data across platforms and deployment environments (in particular, PaaS and containerization). In practical terms, it feels like it's a lot easier to hit performance and overhead targets for the emitting rather than the listening side, and coming up with an efficient in-process interface for capturing that data would be the biggest win for a whole class of use cases.

trevnorris · 2017-01-19T23:53:41Z

I'd like to see documentation about functionality this introduces that would be impossible to implement on top of async hooks (the async hooks EP is up to date).

jasongin · 2017-01-23T20:28:09Z

I will update the doc to explain how this proposal relates to async hooks. I expect tracing instrumentation of node core can leverage async hooks for tracing of async operations, so in that regard the features are complementary.

Async hooks provide a way to hook into the beginning and end of async operations built on libuv primitives. Of course those async operations are a key subset of performance-related events.

This proposal is about a way to capture and/or listen to all kinds of performance-related events, including non-async begin/end, point-in-time, and counters. Additionally and most importantly it provides a very high-performance way to write the events out to a trace log file.

targos · 2017-01-27T09:58:32Z

Is this somehow related to the browser's User Timing API? See https://developer.mozilla.org/en-US/docs/Web/API/User_Timing_API

jasongin · 2017-01-27T21:08:57Z

@targos It is not directly related, but I'm aware of it. I think it would not be a good fit to implement the User Timing API directly in node, because it only has a subset of the functionality (does not support counters, categories, or any mechanism for turning off the tracing) and it does not have the right event-listener model.

If desired, it should be straightforward to create a polyfill for the User Timing API that would use this proposed Node.js tracing API underneath.

trevnorris · 2017-01-27T23:29:00Z

@jasongin I should have clarified. I specifically meant in regards to the JS API. I'm fine placing the trace event hooks in native source. I already have for some testing and they work great. My concern is the JS API. Even if the call does nothing but return immediately there's still enough overhead to be noticeable. Especially when the function call is cross module (V8 has a harder time optimizing those for some reason).

For async hooks to incur the least amount of overhead possible in similar situations I had to do some things I'm rather not proud of (see https://git.io/vDJq9 as an example). So I'd like to avoid placing anything directly in JS. I'd like to see if the JS trace events could be optional async hooks that are enabled when those specific trace events are requested.

jasongin · 2017-01-27T23:50:11Z

The main purpose of a JS API for tracing is to enable people to instrument their JavaScript code. While instrumenting internal async handles and other internal constructs may also provide useful data, that is only a piece of the performance picture and is not the focus of this proposal.

Even if the call does nothing but return immediately there's still enough overhead to be noticeable.

Only if your definition of "noticeable" is extremely generous. When tracing is completely off, any of the tracing event-emitter functions will literally be a no-op function, the equivalent of:

tracing.emit = function () {}

Even if the JIT cannot completely optimize-out the function call, it is still going to be extremely low-cost, I think small enough to not be noticeable in any reasonable sense.

Then of course when tracing gets enabled, either by turning on recording or adding a listener for any category, the emit functions will get replaced with functions that need to do a little bit of work, but are still fast enough not to have a significant impact on most applications.

trevnorris · 2017-01-28T00:25:30Z

I spent around 1.5 week figuring out the best way to optimize for calls to timers and nextTick in my async hooks PR. The reason it was so important to improve is because node doesn't code for the average performance case. We code it for the fast case. Which means measuring against a high volume scenario.

Take a simple http server and add hooks when resources are created and also add timing around each resource's callback. In a high volume scenario, let's say a proxy (and these numbers are small compared to what F5 pushes through node), and do 45k req/sec. Just tracking the mentioned locations you're going to add ~1.5 million calls/sec. That's ~1.5 million noops in JS that have now been added. Which aren't actually noops because V8 can't completely optimize them out. At that volume I'm not being "extremely generous". I'm being practical.

Looking at the code from the link I posted you'd see that all the emitBefore()/emitAfter() calls are conditionally called. Because simply calling the noop function cost too much when I was benchmarking. Short of this is, I didn't spend a month tweaking every corner of the async hooks API to make sure it had no noticeable affect on core's performance just so we could throw in another similar API that would.

jasongin · 2017-01-28T03:54:24Z

I appreciate the passion for performance. I am working on some benchmarks involving tracing in realistic scenarios. I will follow up when I have data ready to share.

naugtur · 2017-02-17T10:24:07Z

00X-tracing.md

+const category = 'example';
+tracing.enableRecording(category);
+
+tracing.emit(category, { eventType: 'instant', name: 'myEvent' });


could tracing expose constants to pass in as eventType instead of expecting particular strings?

mcollina

I'm waiting for this, but I think there are still some nits to be fixed in this EP.

mcollina · 2017-04-05T06:55:02Z

00X-tracing.md

+const scopeId = tracing.emit(category, { eventType: 'begin', name: 'myTask' });
+tracing.emit(category, { eventType: 'end', name: 'myTask', id: scopeId });
+
+tracing.emit(category, { eventType: 'count', name: 'myCounter', value: 3 });


I agree, this should not be called emit at all. Maybe record?

mcollina · 2017-04-05T06:56:10Z

00X-tracing.md

+   * @returns Nonzero event id (either provided or generated) if the event was emitted, or null if
+   *   the event was not emitted because tracing was not enabled for any of the event categories.
+   */
+  emit(category: string | string[], e: TracingEvent): number | null;


I think this is redundant, maybe we can try to avoid overloading as much as possible as it is harder to optimize, and this API should try to be as fast as possible.

mcollina · 2017-04-05T06:57:05Z

00X-tracing.md

+ * This interface has methods that are similar to EventEmitter, with an important difference:
+ * events may be emitted under multiple categories simultaneously instead of a single event name,
+ * and listeners may listen to multiple categories while still only being called once per event.
+ */


If it is similar but different, it should not be called EventEmitter, but something else.

mcollina · 2017-04-05T06:57:51Z

00X-tracing.md

+
+The existing Node.js implementation of those Console tracing methods simply logs tracing events to
+the console. This proposal extends those methods with a new optional category parameter to enable
+emitting categorized tracing events in a way that is backward-compatible with existing console APIs.


I'm 👎 in adding any complexity to console.

mcollina · 2017-04-05T07:01:48Z

00X-tracing.md

+which way may want to adapt to Node.js eventually. And there has been discussion of a more compact
+binary format. This is an area that needs some further development, however for performance reasons
+the tracing output probably would not be exposed to JavaScript anyway, beyond the listener APIs
+that are proposed here.


Some time should be taken to design a processing pipeline for these. I think some tools and documentation to explain how to process this data format. Without interactive and easy-to-understand visualisations, tracing data is extremely hard to understand.

I should be straightforward to plot this data into a flamegraph for example.

rvagg · 2017-06-21T18:18:23Z

https://github.com/brycebaril/transaction-tracer is a proposal that @brycebaril put forward a few years ago coming out the Tracing Summit held a few years ago. Might be good to connect to some of that work. @othiym23 and @trevnorris you were involved in that too weren't you? Was there any other documentation or notes that came out of that which might be useful?

othiym23 · 2017-06-21T19:12:45Z

I'm pretty sure @brycebaril's notes were the most complete; @groundwater acted as our facilitator, and may have recorded some of the outcomes, but I don't know how extensive his notes were, given that he was facilitating. @tjfontaine and @misterdjules were our hosts at the Joyent Vancouver offices and had extensive contributions to make around our efforts to come up with something that merged both APM and dynamic tracing along the lines of DTrace and ETW. I believe @Qard was also at that meeting as well, representing his APM-vendor employers of the time (sorry, Stephen, I'm blanking on what company that was temporarily).

I don't think @trevnorris made it to that particular tracing meeting, although he was certainly integral to the discussions that @groundwater, @tjfontaine, and I were having earlier that year. Trevor was at the error-handling "summit" convened by @dshaw earlier that year, at which @davepacheco was also present. As Trevor's involvement in those discussions eventually produced async_hooks and AsyncWrap, those APIs are far more concrete than any meeting notes are likely to be. ;)

I'll double-check my notebooks from the time, but my recollection is that I didn't keep detailed notes, in part because the discussion was largely focused on establishing the problem space more than solutions, in part because I'd moved on from APM at the time and was more just trying to act as a kind of expert witness.

misterdjules · 2017-06-21T19:43:48Z

FWIW, my notes from the meeting were:

General observations

People don't want to have to monkey patch node to implement tracing.
People seem to be more keen to keep the tracing facilities in node itself
rather than use platform tracing facilities like dtrace/ETW/{System,K}tap.

What some APM providers do/need

New Relic:

Problems they try to solve:
- help developers make their applications faster.
- Identify individual transactions that are slow.
- Don't generate very broad info (no big call graph), but more targetted
  info.
- As lightweight as possible. Needs to not have performance impact on the
  running app.
Solution they use:
- Uses AsyncListener polyfills or core implementation + continuous local
  storage to tie things together.

Strongloop:

Performance tracing: find performance problems and/or bugs (exceptions,
leaks of descriptors, etc.).
- Needs to know start and stop of async events.
Solution they use: monkey patch everything.
Other problems they try to solve: heap/cpu profiling.
Use or would like to use zones.
Mentioned being interested in other discrete events like GC or Node about to
exit/not being able to exit, etc.

Appneta:

Goal: time profiling of the whole async flow.
Mentioned needs to trace async calls in native modules.

Concurix:

Also instrument synchronous code.
Light monkey patching.
Needs to not have performance impact on the running app.

General comments:

Tools to allow native modules to emit tracing/performance events when they
perform async operations using plain libuv.

Conclusion of the "problem discussion"

Priorities:
1. Unavoidable requirements:
- Production ready.
- Do not monkey patch node.
- Do not alter behavior.
1. Async tracing.
2. Track async lifecycle.
3. native module API.

Implementation

Concerns/Questions:

Keep a global state of async trees/transactions but how do we know when we
can cleanup some of these long lasting state structures?

Observations:

Have a simpler solution that just bubbles up simple information and not
complete async trees. AL is too high level.

Proposed API by TJ/Jacob (for a simple solution):

Terminology:
- Namespace, provider, proble. Example:
  node:http:{client-request,client-response,etc.}
Advantages: similar to already defined probes.
API: global someProbe = createProbe('probeName', description of info that is
going to be passed). The description of data is needed to support platforms
providers like dtrace.
someProbe.enable()
someProbe.fire(function () { return [ data that is passed to the callback of the consumer]})
.list(): returns all the probes/namespaces/etc defined in the system.
.on(namespace, provider, probe, callback) with support for wildcards, etc.
Use ring buffers and other similar techniques on the consumer side to avoid
performance hit.
Implementation of user side in https://github.com/joyent/node-tracing .
Namespace is useful to make sure that there is no conflict between probes,
and can also allow some optimizations like filtering out what the consumer
doesn't care about.

Amendments by Jacob:

API: do not need to create the probe explicitely. This is done by the API
the first time a probe is emitted.
The concern with that (from TJ) is that the consumer of these events often
needs to know in advance about the probes that can be fired. Not knowing about
existing probes in advance can make using dtrace and other platform providers
more difficult (but also other consuming platforms), and as Forest pointed out
it's a valuable asset for node.

Concerns

This solution is not generic enough to provide a solution for the number one
priority: Async tracing of any function/callback. Answer to this concern: this
API can be used as a tool to provide information about more generic stuff like
handles creations, etc.
How to trace what happens in third party modules? Adding static probes to
third party modules doesn't seem to be realistic for some people (e.g
concurix).
How to attach the metadata that consumers are interested in? How to make
sure that consumers have enough information? Monkey patching is currently the
easiest way to do that.
Brute force solution for the problem of getting the metadata: emit all the
data of the context at the firing site with the event. Slightly better
proposition from Jacob: emit the associated data only if requested by the
consumer.
Idealistic solution: being able to inspect any context within v8 when the
consumer handles a probe.

Qard · 2017-06-21T20:24:45Z

The challenge that APM vendors faced with coming up with a generic API was that we all had slightly different needs for data collection. That's a big part of why I've been pursuing #56 as a way to apply instrumentation more abstractly.

othiym23 · 2017-06-21T21:09:41Z

Thanks for the detailed notes, @misterdjules, and (belated) apologies for tagging you into the discussion!

Trott · 2018-06-12T23:25:49Z

Closing, but by all means re-open or comment if that isn't right.

proposal: JS tracing APIs

9cd110e

jasongin mentioned this pull request Jan 17, 2017

Trace Events trace system in Node.js nodejs/diagnostics#53

Closed

1 task

Qard reviewed Jan 17, 2017

View reviewed changes

joshgav mentioned this pull request Jan 18, 2017

Diag WG Meeting - 2017-01-19 nodejs/diagnostics#79

Closed

Rename emit overload

81cf5a6

This was referenced Feb 6, 2017

[trace_event] implementing a listener system nodejs/diagnostics#81

Closed

[trace_event] tracking issue nodejs/diagnostics#84

Open

naugtur reviewed Feb 17, 2017

View reviewed changes

mcollina suggested changes Apr 5, 2017

View reviewed changes

matthewloring mentioned this pull request Jul 19, 2017

async-hooks: add trace events to AsyncWrap nodejs/node#14347

Closed

3 tasks

matthewloring mentioned this pull request Aug 12, 2017

perf_hooks: implementation of Performance Timing API nodejs/node#14680

Closed

4 tasks

mike-kaufman mentioned this pull request Dec 18, 2017

Eliminate Monkey Patching for Diagnostic Instrumentation nodejs/diagnostics#134

Closed

digitalinfinity mentioned this pull request Apr 12, 2018

[trace_event] adding trace points to core nodejs/diagnostics#82

Closed

2 tasks

Trott closed this Jun 12, 2018

proposal: JS tracing APIs #48

proposal: JS tracing APIs #48

Conversation

jasongin commented Jan 17, 2017

AndreasMadsen commented Jan 17, 2017 • edited Loading

jasongin commented Jan 17, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

othiym23 commented Jan 18, 2017

trevnorris commented Jan 19, 2017

jasongin commented Jan 23, 2017

targos commented Jan 27, 2017

jasongin commented Jan 27, 2017

trevnorris commented Jan 27, 2017

jasongin commented Jan 27, 2017 • edited Loading

trevnorris commented Jan 28, 2017

jasongin commented Jan 28, 2017

Choose a reason for hiding this comment

mcollina left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rvagg commented Jun 21, 2017

othiym23 commented Jun 21, 2017 • edited Loading

misterdjules commented Jun 21, 2017

General observations

What some APM providers do/need

New Relic:

Strongloop:

Appneta:

Concurix:

General comments:

Conclusion of the "problem discussion"

Implementation

Concerns/Questions:

Observations:

Proposed API by TJ/Jacob (for a simple solution):

Amendments by Jacob:

Concerns

Qard commented Jun 21, 2017

othiym23 commented Jun 21, 2017

Trott commented Jun 12, 2018

AndreasMadsen commented Jan 17, 2017 •

edited

Loading

jasongin commented Jan 27, 2017 •

edited

Loading

othiym23 commented Jun 21, 2017 •

edited

Loading