Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telemetry: Add option for a separate trace for every job attempt/start #2942

Open
lovre-nc opened this issue Nov 27, 2024 · 7 comments
Open
Assignees
Labels
enhancement New feature or request

Comments

@lovre-nc
Copy link

Is your feature request related to a problem? Please describe.
Hi, we've attempted to migrate from @appsignal/opentelemetry-instrumentation-bullmq to the bullmq-otel package.

It seems that a single trace is re-used for the whole lifespan of a job. This is a problem for us because we have a queue where individual jobs exist and repeat forever. We have one job per user and it runs every 5 minutes by delaying it at the end of the worker instead of letting it complete. That means that these jobs have thousands or millions of starts.

When using the official bullmq telemetry this seems to result in one endlessly long trace with thousands or millions of spans inside it.
This behavior may be useful for queues where the jobs are one-off jobs to see multiple job attempts together but it does not work at all for these kind of repeating jobs.

Describe the solution you'd like
Ideally this behavior would be configurable per queue.
We'd like to have one trace per job attempt/start instead of one trace per job.

Describe alternatives you've considered
Staying with @appsignal/opentelemetry-instrumentation-bullmq, but we have some issues with it so that's not ideal.

Additional context
None

@manast manast added the enhancement New feature or request label Nov 27, 2024
@manast
Copy link
Contributor

manast commented Nov 29, 2024

In your use case, do you mean that you have always one active job per user?

@lovre-nc
Copy link
Author

Basically a job in this queue exists as long as the user exists and it runs every 5 minutes and is never moved to completed.

An illustrated example of our worker that should help explain:

const worker = new Worker(
  "pollMessagesForUser",
  async (job: Job, token?: string) => {
    const userId = job.id;

    await messagesService.fetchAndStoreUserMessages(userId);

    await job.moveToDelayed(Date.now() + 300_000, token);
    throw new DelayedError();
  },
  { connection }
);

In the meanwhile I was able to solve the issue myself for now by patching bullmq using pnpm patch:

diff --git a/dist/cjs/utils.js b/dist/cjs/utils.js
index 6caa7bae682a971ffac6f585aca1b54b9aaa3267..b40cd744c05ca81f8eecdca93b7086fed3c1568a 100644
--- a/dist/cjs/utils.js
+++ b/dist/cjs/utils.js
@@ -238,7 +238,7 @@ async function trace(telemetry, spanKind, queueName, operation, destination, cal
         const { tracer, contextManager } = telemetry;
         const currentContext = contextManager.active();
         let parentContext;
-        if (srcPropagationMetadata) {
+        if (false) {
             parentContext = contextManager.fromMetadata(currentContext, srcPropagationMetadata);
         }
         const spanName = destination ? `${operation} ${destination}` : operation;

@manast
Copy link
Contributor

manast commented Dec 2, 2024

@lovre-nc what about having an extra option in Queue.add, where you can specify if you want the given added job to propagate the tracer to the consumer or nor propagate it?

@lovre-nc
Copy link
Author

lovre-nc commented Dec 3, 2024

@manast In our case specifically, I can't think of a scenario where we would want to change this on the job level. In our system, different types of jobs always have their own, separate queues.
But considering that many other bullmq options can be set on both Queue and Job level, it may make sense.

Not sure if this is related enough, but what could be very useful, is being able to supply a parent trace or context ( not sure about the terminology) manually when adding a job. For example when a few different jobs are part of a larger operation, this would allow us to have all these related jobs and other non-bullmq spans together in one trace.

Example trace
operation: Upload video

upload and store video
━━━━━━━-----------------
       process video (bullmq job)
-------━━━━━━━━━━━------
       process audio (bullmq job)
-------━━━━-------------
       generate subtitles (bullmq job)
-----------━━━-----------
                  generate thumbnail
------------------━━----
                    publish
--------------------━---
                     notify subscribers (bullmq job)
---------------------━━━

@manast
Copy link
Contributor

manast commented Dec 3, 2024

@lovre-nc yes, If the service that started the whole process is in a different machine than the service that is adding the job, then I guess you would need to have the possibility to specify the telemetry context so that you can get a trace spanning all these processes. In our tutorial we have the case where a express server takes HTTP requests that result in a single trace for all the process: https://blog.taskforce.sh/how-to-integrate-bullmqs-telemetry-on-a-newsletters-subscription-application-2/

@manast
Copy link
Contributor

manast commented Dec 3, 2024

@lovre-nc actually, it is already possible to specify the context metadata when adding a job, we use this option internally precisely to keep all the spans in the same trace.

@lovre-nc
Copy link
Author

lovre-nc commented Dec 4, 2024

@manast that's perfect, sorry I missed that.

The omitContext option from the linked PR would still be very nice to have to solve my original issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants