-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support APM Server intake API version 2 #465
Conversation
This PR should technically be ready for testing with the POC. The checklist above is meant for making it production ready |
would be great to get some benchmarks to see if this implementation is better than the current intake protocol, for example from a memory usage perspective. |
lib/agent.js
Outdated
agent._apmServer.flush(cb) | ||
}) | ||
} else { | ||
process.nextTick(cb) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you log something here so we can tell when someone tries to send something without _apmServer
set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
lib/agent.js
Outdated
if (this._apmServer) { | ||
this._apmServer.flush(cb) | ||
} else { | ||
process.nextTick(cb) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, log something so we know when _apmServer
is not set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
lib/instrumentation/transaction.js
Outdated
@@ -50,6 +49,15 @@ function Transaction (agent, name, type) { | |||
} | |||
}) | |||
|
|||
Object.defineProperty(this, 'timestamp', { | |||
get: function () { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use shorthand here. get () {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -112,17 +121,16 @@ Transaction.prototype.buildSpan = function () { | |||
} | |||
|
|||
if (this.ended) { | |||
this._agent.logger.debug('transaction already ended - cannot build new span %o', {id: this.id}) | |||
this._agent.logger.debug('transaction already ended - cannot build new span %o', {id: this.id}) // TODO: Should this be supported in the new API? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When everything is fully streaming, I don't think there's any reason to prevent further spans. It's only useful without streaming to ensure the transaction doesn't stay open forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why I'm hesitant with this is that we might have a transaction that kicks off some form of background job that gets contextually attached to the transaction that started it. Any spans that's created as a result of this would get associated with the transaction. This will make the span waterfall almost useless as it can suddenly cover a period of days or weeks. Am I being too paranoid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That'd be user mode queuing though. We won't actually form that link unless we explicitly bind the callback. The logical continuation to the next job would be out-of-band from the request, it'd be triggered by the completion of another job in the queue.
I could see maybe time-based jobs naively using setTimeout
for everything, and that maybe being an issue. But that's just not great design and we probably should be drawing attention to it in some way. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah setTimeout
and friends are probably the most common way to trigger this.
Question is however if the spans happening after the transaction ends are relevant to the user at all?
@elastic/apm-agent-devs @roncohen @makwarth In the new intake API, we get the ability to record spans that start after the parent transaction ends. Is this something we want to do or should those just be ignored? I.e. are they adding any value to the user or are they just noise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I.e. are they adding any value to the user or are they just noise?
I think that depends on how the application is making use of that. I'd rather default to not ignore than to ignore these spans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't ignore them by default, would it be a good idea to have a config option to ignore them?
It could even have 3 settings:
- Ignore spans started after the transaction ends
- Ignore spans started before the transaction ends, but ended after
- Allow all spans
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i understand you're worried about the case where people would set up a recurring periodic job from inside a transaction, because it would mean spans continue to happen in an endless stream. Two ideas:
- We could experiment in making the assumption that there are two categories of delayed execution
process.nextTick()/setImmediate()
andsetTimeout()/setInterval()
. The first one could indicate that the developer intended this to be part of the transaction, while the second category will start a new transaction. - We could set a numerical limit to the number of spans that we will consider, after a transaction has ended - or a wall clock timeout which would be the limit to how much time we'll continue to "monitor" a transaction after is has ended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also include a special escape hatch that specifically calls to setTimeout
or setInterval
after the transaction ends would not continue the context. That way, anything else in the code path will continue, but anything that could potentially be significantly delayed will not.
To make the scope of this PR as small as possible, I've made it go to a different branch than |
To help this PR getting merge, I've created the meta issue #471 and moved all the non-essential todo items from this PR to that |
jenkins, run tav tests please |
this._apmServer.flush(cb) | ||
} else { | ||
this.logger.warn(new Error('cannot flush agent before it is started')) | ||
process.nextTick(cb) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worthwhile for the flush callback be able to receive that error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left it out as this only occurs if the agent isn't started. So from the outside I want the user to use the module in the same way no matter if the agent is started or not. And normally errors coming from the _apmServer
is just logged, so I think it's best to maintain that behavior here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I just realized that the callback given to captureError
will pass a similar error along in the callback: https://github.com/elastic/apm-agent-nodejs/pull/465/files#diff-7403da6ce2244c65d641fdfcc095be93R331
But I think maybe that should be avoided in captureError
. If so the captureError
callback will never be called with an error either.
lib/config.js
Outdated
@@ -28,7 +28,9 @@ var DEFAULTS = { | |||
verifyServerCert: true, | |||
active: true, | |||
logLevel: 'info', | |||
hostname: os.hostname(), | |||
hostname: os.hostname(), // TODO: Should we just let the http client default to this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defaulting in the http client sounds reasonable to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
lib/instrumentation/index.js
Outdated
if (!payload) return agent.logger.debug('transaction ignored by filter %o', {id: transaction.id}) | ||
truncate.transaction(payload) | ||
agent.logger.debug('sending transaction %o', {id: transaction.id}) | ||
if (agent._apmServer) agent._apmServer.sendTransaction(payload) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way for agent._apmServer
to be null when this._started
is true? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No you're right - if this._started === true
, then we'll also have an agent._apmServer
. Would that be a better check? The current check is guaranteed to always work, whereas the conditions of the other might change in the future right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just made that comment, because this check is already in an outer if
block checking this._started
, so it seemed redundant. Github cuts off the preview just after that line though. 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes... I'll remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
architecture: process.arch, | ||
platform: process.platform | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like process, system and runtime stuff no longer exists? I don't see it in the http client PR either. Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I can see it's a bit confusing given the current state of the all the in-progress dependencies.
As you can see here, this PR depends on a custom branch of the http client called v2-1
that currently lives on my fork: https://github.com/elastic/apm-agent-nodejs/pull/465/files#diff-b9cfc7f2cdf78a7f4b91a753d10865a2R74
In that branch the metadata responsibilities have been moved to the http client: https://github.com/watson/apm-nodejs-http-client/blob/4f02d7eda173b738a309fdca2e3464a1bbae5652/index.js#L293-L332
But there's no PR for this branch yet as I don't want to make the current http client PR too complicated, so once the the current PR is merged, I'll open a new one for the v2-1
branch 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just merged the first http client PR and created the next based on my v2-1
branch: elastic/apm-nodejs-http-client#6 - soon it should hopefully be less confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. Gotcha. I was trying to match up the two PRs to make sure the functionality lined up between the two. 😅
test/_apm_server.js
Outdated
client.request('transactions', {}, body, () => {}) | ||
req.pipe(zlib.createGunzip()).pipe(ndjson.parse()).on('data', function (data) { | ||
if (req.method !== 'POST') throw new Error(`Unexpected HTTP method: ${req.method}`) | ||
if (req.url !== '/v2/intake') throw new Error(`Unexpected HTTP url: ${req.url}`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could do the req field validation outside the pipe sequence and data handler. Also, the assert
module might be cleaner here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
t.equal(body.errors[0].exception.message, 'with callback') | ||
.on('data-error', function (data) { | ||
t.equal(data.exception.message, 'with callback') | ||
t.end() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does tape work correctly when you explicitly t.end()
while using a t.plan(...)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, all it does when reaching the t.end
is just fail if the number of assertions doesn't match what's planned. This is a way to validate our expectations of the async execution order. I.e. when we reach this point, we should have made the planned number of assertions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, cool. 👍
test/instrumentation/_agent.js
Outdated
@@ -54,7 +49,7 @@ module.exports = function mockAgent (cb) { | |||
sharedInstrumentation._agent = agent | |||
agent._instrumentation = sharedInstrumentation | |||
agent._instrumentation.currentTransaction = null | |||
agent._instrumentation._queue._clear() | |||
agent._apmServer = mockClient(expected, cb || noop) // TODO: Expected will not work here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment still accurate? What's the issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I think you're right. I think this line is completely irrelevant and can just be removed. I'll update it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
test/instrumentation/span.js
Outdated
t.ok(payload.start > 0) | ||
t.ok(payload.duration > 0) | ||
assert.stacktrace(t, 'myTest3', __filename, payload.stacktrace, agent) | ||
t.deepEqual(Object.keys(payload), ['transactionId', 'timestamp', 'name', 'type', 'start', 'duration']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't think we need to validate the contents of the object here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
test/integration/no-sampling.js
Outdated
t.end() | ||
process.exit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this process.exit()
need to be here? I'd rather not have these hiding in the tests as they could cause confusion when adding more tests to the file in the future, if you don't notice them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Finally managed to get a complete review it. Looks good, other than a couple more comments. |
Approved for v2 branch. Any remaining issues can be fixed in another PR, before merging v2 to master. |
Closes #356
Checklist