Telemetry: Add option for a separate trace for every job attempt/start #2942

lovre-nc · 2024-11-27T19:17:04Z

Is your feature request related to a problem? Please describe.
Hi, we've attempted to migrate from @appsignal/opentelemetry-instrumentation-bullmq to the bullmq-otel package.

It seems that a single trace is re-used for the whole lifespan of a job. This is a problem for us because we have a queue where individual jobs exist and repeat forever. We have one job per user and it runs every 5 minutes by delaying it at the end of the worker instead of letting it complete. That means that these jobs have thousands or millions of starts.

When using the official bullmq telemetry this seems to result in one endlessly long trace with thousands or millions of spans inside it.
This behavior may be useful for queues where the jobs are one-off jobs to see multiple job attempts together but it does not work at all for these kind of repeating jobs.

Describe the solution you'd like
Ideally this behavior would be configurable per queue.
We'd like to have one trace per job attempt/start instead of one trace per job.

Describe alternatives you've considered
Staying with @appsignal/opentelemetry-instrumentation-bullmq, but we have some issues with it so that's not ideal.

Additional context
None

manast · 2024-11-29T09:21:22Z

In your use case, do you mean that you have always one active job per user?

lovre-nc · 2024-11-29T13:55:47Z

Basically a job in this queue exists as long as the user exists and it runs every 5 minutes and is never moved to completed.

An illustrated example of our worker that should help explain:

const worker = new Worker(
  "pollMessagesForUser",
  async (job: Job, token?: string) => {
    const userId = job.id;

    await messagesService.fetchAndStoreUserMessages(userId);

    await job.moveToDelayed(Date.now() + 300_000, token);
    throw new DelayedError();
  },
  { connection }
);

In the meanwhile I was able to solve the issue myself for now by patching bullmq using pnpm patch:

diff --git a/dist/cjs/utils.js b/dist/cjs/utils.js
index 6caa7bae682a971ffac6f585aca1b54b9aaa3267..b40cd744c05ca81f8eecdca93b7086fed3c1568a 100644
--- a/dist/cjs/utils.js
+++ b/dist/cjs/utils.js
@@ -238,7 +238,7 @@ async function trace(telemetry, spanKind, queueName, operation, destination, cal
         const { tracer, contextManager } = telemetry;
         const currentContext = contextManager.active();
         let parentContext;
-        if (srcPropagationMetadata) {
+        if (false) {
             parentContext = contextManager.fromMetadata(currentContext, srcPropagationMetadata);
         }
         const spanName = destination ? `${operation} ${destination}` : operation;

manast · 2024-12-02T21:50:31Z

@lovre-nc what about having an extra option in Queue.add, where you can specify if you want the given added job to propagate the tracer to the consumer or nor propagate it?

lovre-nc · 2024-12-03T16:13:01Z

@manast In our case specifically, I can't think of a scenario where we would want to change this on the job level. In our system, different types of jobs always have their own, separate queues.
But considering that many other bullmq options can be set on both Queue and Job level, it may make sense.

Not sure if this is related enough, but what could be very useful, is being able to supply a parent trace or context ( not sure about the terminology) manually when adding a job. For example when a few different jobs are part of a larger operation, this would allow us to have all these related jobs and other non-bullmq spans together in one trace.

Example trace

operation: Upload video

upload and store video
━━━━━━━-----------------
       process video (bullmq job)
-------━━━━━━━━━━━------
       process audio (bullmq job)
-------━━━━-------------
       generate subtitles (bullmq job)
-----------━━━-----------
                  generate thumbnail
------------------━━----
                    publish
--------------------━---
                     notify subscribers (bullmq job)
---------------------━━━

manast · 2024-12-03T16:40:36Z

@lovre-nc yes, If the service that started the whole process is in a different machine than the service that is adding the job, then I guess you would need to have the possibility to specify the telemetry context so that you can get a trace spanning all these processes. In our tutorial we have the case where a express server takes HTTP requests that result in a single trace for all the process: https://blog.taskforce.sh/how-to-integrate-bullmqs-telemetry-on-a-newsletters-subscription-application-2/

manast · 2024-12-03T16:41:46Z

@lovre-nc actually, it is already possible to specify the context metadata when adding a job, we use this option internally precisely to keep all the spans in the same trace.

lovre-nc · 2024-12-04T10:01:35Z

@manast that's perfect, sorry I missed that.

The omitContext option from the linked PR would still be very nice to have to solve my original issue.

manast added the enhancement New feature or request label Nov 27, 2024

manast assigned fgozdz Nov 27, 2024

fgozdz mentioned this issue Nov 29, 2024

feat(telemetry): add option to omit context propagation #2946

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Telemetry: Add option for a separate trace for every job attempt/start #2942

Telemetry: Add option for a separate trace for every job attempt/start #2942

lovre-nc commented Nov 27, 2024

manast commented Nov 29, 2024

lovre-nc commented Nov 29, 2024

manast commented Dec 2, 2024

lovre-nc commented Dec 3, 2024

manast commented Dec 3, 2024

manast commented Dec 3, 2024

lovre-nc commented Dec 4, 2024

Telemetry: Add option for a separate trace for every job attempt/start #2942

Telemetry: Add option for a separate trace for every job attempt/start #2942

Comments

lovre-nc commented Nov 27, 2024

manast commented Nov 29, 2024

lovre-nc commented Nov 29, 2024

manast commented Dec 2, 2024

lovre-nc commented Dec 3, 2024

manast commented Dec 3, 2024

manast commented Dec 3, 2024

lovre-nc commented Dec 4, 2024