Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantic conventions for batch jobs #1640

Open
mateuszrzeszutek opened this issue Jan 15, 2021 · 11 comments
Open

Semantic conventions for batch jobs #1640

mateuszrzeszutek opened this issue Jan 15, 2021 · 11 comments

Comments

@mateuszrzeszutek
Copy link
Member

What are you trying to achieve?

I want to introduce some semantic conventions for batch jobs, since there are currently no conventions around that. I'm mostly interested in instrumenting Spring Batch applications.
There already is a Java JSR-352 spec that describes a batch job API, I was thinking of basing the trace spec on that - the concepts of job/step/chunk seem generic and language-agnostic enough (and there doesn't seem to be any other batch job specification).
Before diving into details, is there a place for this in the trace semantic conventions?

Additional context.

@fbogsany
Copy link
Contributor

Good timing: we discussed this in the Ruby SIG meeting earlier this week. Background job queues are common in Ruby web applications, with the dominant implementations being Resque and Sidekiq. So far, we've instrumented those systems using the messaging semantic conventions, but they're really not a great fit for background/batch jobs. See open-telemetry/opentelemetry-ruby#547 (comment)

@Oberon00
Copy link
Member

Oberon00 commented Jan 15, 2021

Wouldn't this be a case where you just use an INTERNAL span without any attributes? Not every span needs to conform to a semantic convention. What information do you want to have on batch job spans?

@fbogsany
Copy link
Contributor

the concepts of job/step/chunk

^ this bit seems useful.

For Ruby batch job systems, relative to message systems:

  1. "receiver" spans are not particularly useful
  2. The job class name is usually much more interesting as part of the span name than the queue name (e.g. MyJob enqueue vs default send)
  3. The ... enqueue suffix is more in keeping with the domain language than ... send
  4. destination_kind is likely always queue, so probably isn't an interesting thing to specify (and certainly shouldn't be required).

@Oberon00
Copy link
Member

@fbogsany
Copy link
Contributor

For the "job class name": There are semantic conventions for code locations, see https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/trace/semantic_conventions/span-general.md#source-code-attributes.

That's useful, but doesn't meet the expectations of users re: span names.

@mateuszrzeszutek
Copy link
Member Author

Good timing: we discussed this in the Ruby SIG meeting earlier this week.

Nice! I'll take a look at sidekiq & your instrumentation and try to extract common parts - at first it looks like the job span would be the only common thing, but maybe there's more. And there's the whole queue logic that's not there in Spring Batch/JSR-352.

Wouldn't this be a case where you just use an INTERNAL span without any attributes? Not every span needs to conform to a semantic convention. What information do you want to have on batch job spans?

True, JSR-352 does not expose much information that we could store as attributes, but that does not mean that there are zero of them: there's exit status of a job/step (arbitrary string), job/step execution id (jid in sidekiq?). And the at least for the JSR-352/Spring Batch job steps expose several metrics (read count, write count, ...) that could be used.
And probably the most important piece of information is the span name.

@mateuszrzeszutek
Copy link
Member Author

This is what I had in mind for my use case (Spring Batch):

  1. JobLauncher/JobOperator/enqueue span that marks the (possibly) asynchronous start of a batch job processing.
    Name: Start Job <batch.job.name>
    Attributes:
    • batch.job.name: the name of the job (Spring Batch), or the job class name (resque/sidekiq), or the task name (celery);
    • batch.job.id: the job execution id (Spring Batch), job['jid'] in case of sidekiq.
  2. Job span that wraps the whole processing of a batch job.
    Name: Job <batch.job.name>
    Attributes:
    • batch.job.name;
    • batch.job.id;
    • batch.job.exit_status: a plain string containing the exit status of a job. JSR-352/Spring Batch jobs (and steps) can return an arbitrary user-defined string as the exit status (e.g. error message saying why the job has failed). Not sure how this translates to Ruby/Python frameworks.
  3. Step span that wraps the execution of a step. In JSR-352, a job consists of a series of steps: a step "encapsulates an independent, sequential phase of a batch job". In the simplest possible case, a job that only does one thing has only one step.
    Name: Job <batch.job.name>.<batch.step.name>
    Attributes:
    • batch.step.name: the name of the step;
    • batch.step.id: the step execution id;
    • batch.step.exit_status: a plain string containing the exit status of a step. The batch job may take action depending on the result of a step, e.g. send an email and stop further processing in case of failure.
  4. Chunk span. In JSR-352, steps that process numerous items (e.g. read a CSV file and write records into DB) may be split into chunks, which are pretty much equivalent to database transactions. For example, a step that writes thousands of records may be configured to commit every 50 items. There are no attributes that can be set on this span, but an exception can be recorded here.
  5. Item read, process and write spans. This is the lowest level we want to instrument. I think that this image and pseudocode fragment describes what happens here the best. Again, there are no attributes here, but it's worth having spans on the item level because of the exception visibility and better trace structure.

@fbogsany I believe that the first two spans that I've briefly described here match your use case with both Ruby libs that you've mentioned. I'm not sure about the other three, sidekiq/resque do not seem to have this sort of rigid job structure that spring batch has.

@fbogsany
Copy link
Contributor

I'm not sure about the other three, sidekiq/resque do not seem to have this sort of rigid job structure that spring batch has.

They don't have the Step span, but at Shopify, for example, we have a higher-level job execution framework that provides an equivalent of "chunks", so the Chunk span is relevant there. I'm not sure about the Item read, process and write spans.

@weyert
Copy link

weyert commented Apr 6, 2021

Yeah, doesn't look like Hangfire (.net) and Bree (Node.js) they same level structure as Spring Batch (never used this) and looks more simplified

[0] https://docs.hangfire.io/en/latest/background-processing/processing-background-jobs.html
[1] https://jobscheduler.net/

@Kiiv
Copy link

Kiiv commented Nov 28, 2024

Hi,

It seems that there is some properties to enable instrumentation of spring batch now : https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/instrumentation/spring/README.md#settings

Does it should alse work with JSR352 implementation ? I have some batch implemented over JSR352 running in an OpenLiberty server (no spring at all). I tried to enable "otel.instrumentation.spring-batch.item.enabled" but nothing appears. I just want to be sure that I didn't missed something.

@trask
Copy link
Member

trask commented Nov 30, 2024

Does it should alse work with JSR352 implementation?

hi @Kiiv, I'd suggest asking your question in https://github.com/open-telemetry/opentelemetry-java-instrumentation/issues or #otel-java

@danielgblanco danielgblanco transferred this issue from open-telemetry/opentelemetry-specification Dec 2, 2024
gyliu513 pushed a commit to gyliu513/semantic-conventions that referenced this issue Dec 6, 2024
I noticed developers adding their own attributes to this namespace
without going through the specification. We need to regulate this
namespace through the specification, just like we do it for other
semantic conventions.
gyliu513 pushed a commit to gyliu513/semantic-conventions that referenced this issue Dec 6, 2024
I noticed developers adding their own attributes to this namespace
without going through the specification. We need to regulate this
namespace through the specification, just like we do it for other
semantic conventions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants